Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 151 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

latest

About

Loading...

Loading...

Loading...

Loading...

Installation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

User Guide

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Seldon Core Features

Explore the enterprise MLOps capabilities of Core 2 for production ML deployment, featuring model serving, pipeline orchestration, and intelligent resource management for scalable ML systems.

After the models are deployed, Core 2 enables the monitoring and experimentation on those systems in production. With support for a wide range of model types, and design patterns to build around those models, you can standardize ML deployment across a range of use-cases in the cloud or on-premise serving infrastructure of your choice.

Model Deployment

Seldon Core 2 orchestrates and scales machine learning components running as production-grade microservices. These components can be deployed locally or in enterprise-scale kubernetes clusters. The components of your ML system - such as models, processing steps, custom logic, or monitoring methods - are deployed as Models, leveraging serving solutions compatible with Core 2 such as MLServer, Alibi, LLM Module, or Triton Inference Server. These serving solutions package the required dependencies and standardize inference using the Open Inference Protocol. This ensures that, regardless of your model types and use-cases, all request and responses follow a unified format. After models are deployed, they can process REST or gRPC requests for real-time inference.

Complex Applications & Orchestration

Machine learning applications are increasingly complex. They've evolved from individual models deployed as services, to complex applications that can consist of multiple models, processing steps, custom logic, and asynchronous monitoring components. With Core you can build Pipelines that connect any of these components to make data-centric applications. Core 2 handles orchestration and scaling of the underlying components of such an application, and exposes the data streamed through the application in real time using Kafka.

Data-centricity is an approach that places the management, integrity, and flow of data at the core of the machine learning deployment framework.

This approach to MLOps, influenced by our position paper , enables real-time observability, insight, and control on the behavior, and performance of your ML systems.

Lastly, Core 2 provides Experiments as part of its orchestration capabilities, enabling users to implement routing logic such as A/B tests or Canary deployments to models or pipelines in production. After experiments are run, you can promote new models and/or pipelines, or launch new experiments, so that you can continuously improve the performance of your ML applications.

Resource Management

In Seldon Core 2 your models are deployed on inference servers, which are software that manage the packaging and execution of ML workloads. As part its design, Core 2 separates out Servers and Models as separate resources. This approach enables flexible allocation of models to servers aligning with the requirements of your models, and to the underlying infrastructure that you want your servers to run on. Core 2 also provides functionality to autoscale your models and servers up and down as needed based on your workload requirements or user-defined metrics.

With the modular design of Core 2, users are able to implement cutting-edge methods to minimize hardware costs:

  • Multi-Model serving consolidates multiple models onto shared inference servers to optimize resource utilization and decrease the number of servers required.

  • Over-commit allows you to provision more models than the available memory would normally allow by dynamically loading and unloading models from memory to disk based on demand.

End-to-End MLOps with Core 2

Core 2 demonstrates the power of a standardized, data-centric approach to MLOps at scale, ensuring that data observability and management are prioritized across every layer of machine learning operations. Furthermore, Core 2 seamlessly integrates into end-to-end MLOps workflows, from CI/CD, managing traffic with the service mesh of your choice, alerting, data visualization, or authentication and authorization.

This modular, flexible architecture not only supports diverse deployment patterns but also ensures compatibility with the latest AI innovations. By embedding data-centricity and adaptability into its foundation, Core 2 equips organizations to scale and improve their machine learning systems effectively, to capture value from increasingly complex AI systems.

Next Steps

  • Explore our

  • for updates or for answers to any questions

Production-ready ML Serving Framework

Discover Seldon Core 2, a Kubernetes-native MLOps framework for deploying ML and LLM systems at scale. Features flexible architecture, standardized workflows, and enhanced observability.

Seldon Core 2 is a Kubernetes-native framework for deploying and managing machine learning (ML) and Large Language Model (LLM) systems at scale. Its data-centric approach and modular architecture enable seamless handling of simple models to complex ML applications across on-premise, hybrid, and multi-cloud environments while ensuring flexibility, standardization, observability, and cost efficiency.

Seldon Core 2 Key Differentiators

Flexibility: Real-time, your way

Seldon Core 2 offers a platform-agnostic, flexible framework for seamless deployment of different types of ML models across on-premise, cloud, and hybrid environments. Its adaptive architecture enables customizable applications, future-proofing MLOps or LLMOps by scaling deployments as data and applications evolve. The modular design enhances resource-efficiency, allowing dynamic scaling, component reuse, and optimized resource allocation. This ensures long-term scalability, operational efficiency, and adaptability to changing demands.

Standardization: Consistency across workflows

Seldon Core 2 enforces best practices for ML deployment, ensuring consistency, reliability, and efficiency across the entire lifecycle. By automating key deployment steps, it removes operational bottlenecks, enabling faster rollouts and allowing teams to focus on high-value tasks.

With a "learn once, deploy anywhere" approach, Seldon Core 2 standardizes model deployment across on-premise, cloud, and hybrid environments, reducing risk and improving productivity. Its unified execution framework supports conventional, foundational, and LLM models, streamlining deployment and enabling seamless scalability. It also enhances collaboration between MLOps Engineers, Data Scientists, and Software Engineers by providing a customizable framework that fosters knowledge sharing, innovation, and the adoption of new data science capabilities.

Enhanced Observability

Observability in Seldon Core 2 enables real-time monitoring, analysis, and performance tracking of ML systems, covering data pipelines, models, and deployment environments. Its customizable framework combines and monitoring, ensuring teams have the key metrics needed for maintenance and decision-making.

Seldon simplifies operational monitoring, allowing real-time ML or LLM deployments to expand across organizations while supporting complex, mission-critical use cases. A ensures all prediction data is auditable, maintaining explainability, compliance, and trust in AI-driven decisions.

Optimization: Modularity to maximize resource efficiency

Seldon Core 2 is built for scalability, efficiency, and cost-effective ML operations, enabling you to deploy only the necessary components while maintaining agility and high performance. Its modular architecture ensures that resources are optimized, infrastructure is consolidated, and deployments remain adaptable to evolving business needs.

Scaling for Efficiency: infrastructure dynamically based on demand, auto-scaling for real-time use cases while scaling to zero for on-demand workloads, preserving deployment state for seamless reactivation. By eliminating redundancy and optimizing deployments, it balances cost efficiency and performance, ensuring reliable inference at any scale.

Consolidated Serving Infrastructure: maximizes resource utilization with multi-model serving (MMS) and overcommit, reducing compute overhead while ensuring efficient, reliable inference.

Extendability & Modular Adaptation: integrates with LLMs, Alibi, and other modules, enabling on-demand ML expansion. Its modular design ensures scalable AI, maximizing value extraction, agility, and cost efficiency.

Reusability for Cost Optimization: provides predictable, fixed pricing, enabling cost-effective scaling and innovation while ensuring financial transparency and flexibility.

Next Steps

  • Explore our other

  • Learn about the

  • for updates or for answers to any questions

operational
data science
data-centric approach
scaling
Solutions
feature of Seldon Core 2
Join our Slack Community
Desiderata for next generation of ML model serving
Install Seldon Core 2
Tutorials
Join our Slack Community
Data-centric pipeline
Example: Serving multiple model types across inference servers

Concepts

Learn how Seldon Core is different from a centralized orchestrator to a data flow architecture, enabling more efficient ML model deployment through stream processing and improved data handling.

In the context of machine learning and Seldon Core 2, concepts provide a framework for understanding key functionalities, architectures, and workflows within the system. Some of the key concepts in Seldon Core 2 are:

  • Data-Centric MLOps

  • Open Inference Protocol

  • Components

Data-Centric MLOps

Data-centricity is an approach that puts the management, integrity, and flow of data at the core of machine learning deployment. Rather than focusing solely on models, this approach ensures that data quality, consistency, and adaptability drive successful ML operations. In Seldon Core 2, data-centricity is embedded in every stage of the inference workflow.

Why Data-Centricity Matters?

By adopting a data-centric approach, Seldon Core 2 enables:

  • More reliable and high-quality predictions by ensuring clean, well-structured data.

  • Scalable and future-proof ML deployments through standardized data management.

  • Efficient monitoring and maintenance, reducing risks related to model drift and inconsistencies.

With data-centricity as a core principle, Seldon Core 2 ensures end-to-end control over ML workflows, enabling you to maximize model performance and reliability in production environments.

Open Inference Protocol in Seldon Core 2

The Open Inference Protocol (OIP) defines a standard way for inference servers and clients to communicate in Seldon Core 2. Its goal is to enable interoperability, flexibility, and consistency across different model-serving runtimes. It exposes the Health API, Metadata API, and the Inference API.

Some of the features of Open Inference Protocol includes:

  • Transport Agnostic: Servers can implement HTTP/REST or gRPC protocols.

  • Runtime Awareness: Use the protocolVersion field in your runtime YAML or consult the supported runtimes table to verify compatibility.

By adopting OIP, Seldon Core 2 promotes a consistent experience across a diverse set of model deployments.

Components in Seldon Core 2

Components are the building blocks of an inference graph, processing data at various stages of the ML inference pipeline. They provide reusable, standardized interfaces, making it easier to maintain and update workflows without disrupting the entire system. Components include ML models, data processors, routers, and supplementary services.

Types of Components

By modularizing inference workflows, components allow you to scale, experiment, and optimize ML deployments efficiently while ensuring data consistency and reliability.

Pipelines

In a model serving platform like Seldon Core 2, a pipeline refers to an orchestrated sequence of models and components that work together to serve more complex AI applications. These pipelines allow you to connect models, routers, transformers, and other inference components, with Kafka used to stream data between them. Each component in the pipeline is modular and independently managed—meaning it can be scaled, configured, and updated separately based on its specific input and output requirements.

This use of "pipeline" is distinct from how the term is used in MLOps for CI/CD pipelines, which automate workflows for building, testing, and deploying models. In contrast, Core 2 pipelines operate at runtime and focus on the live composition and orchestration of inference systems in production.

Servers

In Core 2, servers are responsible for hosting and serving machine learning models, handling inference requests, and ensuring scalability, efficiency, and observability in production. Core 2 supports multiple inference servers, including MLServer, and NVIDIA Triton Inference Server, enabling flexible and optimized model deployments.

MLServer: A lightweight, extensible inference server designed to work with multiple ML frameworks, including scikit-learn, XGBoost, TensorFlow, and PyTorch. It supports custom Python models and integrates well with MLOps workflows. It is built for *General-purpose model serving, custom model wrappers, multi-framework support.

Experiments

In Seldon Core 2, experiments enable controlled A/B testing, model comparisons, and performance evaluations by defining an HTTP traffic split between different models or inference pipelines. This allows organizations to test multiple versions of a model in production while managing risk and ensuring continuous improvements.

Some of the advantages of using Experiments:

  • A/B Testing & Model comparison: Compare different models under real-world conditions without full deployment.

  • Risk-Free model validation: Test a new model or pipeline in parallel without affecting live predictions.

  • Performance monitoring: Assess latency, accuracy, and reliability before a full rollout.

  • Continuous improvement: Make data-driven deployment decisions based on real-time model performance.

Architecture

Learn about the Seldon Core 2 microservice architecture, control plane and data plane components, and how these services work to provide a scalable, fault-tolerant ML model serving and management.

Seldon Core 2 uses a microservice architecture where each service has limited and well-defined responsibilities working together to orchestrate scalable and fault-tolerant ML serving and management. These components communicate internally using gRPC and they can be scaled independently. Seldon Core 2 services can be split into two categories:

  • Control Plane services are responsible for managing the operations and configurations of your ML models and workflows. This includes functionality to instantiate new inference servers, load models, update new versions of models, configure model experiments and pipelines, and expose endpoints that may receive inference requests. The main control plane component is the Scheduler that is responsible for managing the loading and unloading of resources (models, pipelines, experiments) onto the respective components.

  • Data Plane services are responsible for managing the flow of data between components or models. Core 2 supports REST and gRPC payloads that follow the Open Inference Protocol (OIP). The main data plane service is

Envoy
, which acts as a single ingress for all data plane load and routes data to the relevant servers internally (e.g. Seldon MLServer or NVidia Triton pods).

Note: Because Core 2 architecture separates control plane and data plane responsibilities, when control plane services are down (e.g. the Scheduler), data plane inference can still be served. In this manner the system is more resilient to failures. For example, an outage of control plane services does not impact the ability of the system to respond to end user traffic. Core 2 can be provisioned to be highly available on the data plane path.

The current set of services used in Seldon Core 2 is shown below. Following the diagram, we will describe each control plane and data plan service.

Control Plane

Scheduler

This service manages the loading and unloading of Models, Pipelines and Experiments on the relevant micro services. It is also responsible for matching Models with available Servers in a way that optimises infrastructure use. In the current design we can only have one instance of the Scheduler as its internal state is persisted on disk.

When the Scheduler (re)starts there is a synchronisation flow to coordinate the startup process and to attempt to wait for expected Model Servers to connect before proceeding with control plane operations. This is important so that ongoing data plane operations are not interrupted. This then introduces a delay on any control plane operations until the process has finished (including control plan resources status updates). This synchronisation process has a timeout, which has a default of 10 minutes. It can be changed by setting helm seldon-core-v2-components value scheduler.schedulerReadyTimeoutSeconds.

Agent

This service manages the loading and unloading of models on a server and access to the server over REST/gRPC. It acts as a reverse proxy to connect end users with the actual Model Servers. In this way the system collects stats and metrics about data plane inferences that helps with observability and scaling.

Controller

We also provide a Kubernetes Operator to allow Kubernetes usage. This is implemented in the Controller Manager microservice, which manages CRD reconciliation with Scheduler. Currently Core 2 supports one instance of the Controller.

Note: All services besides the Controller are Kubernetes agnostic and can run locally, e.g. on Docker Compose.

Data Plane

Pipeline Gateway

This service handles REST/gRPC calls to Pipelines. It translates between synchronous requests to Kafka operations, producing a message on the relevant input topic for a Pipeline and consuming from the output topic to return inference results back to the users.

Model Gateway

This service handles the flow of data from models to inference requests on servers and passes on the responses via Kafka.

Dataflow Engine

This service handles the flow of data between components in a pipeline, using Kafka Streams. It enables Core 2 to chain and join Models together to provide complex Pipelines.

Envoy

Envoy acts as the request proxy for Seldon Core 2, routing traffic to the appropriate inference servers and handling load balancing across available replicas. Envoy configuration of Seldon Core 2 uses weighted least-request load balancing, which dynamically distributes traffic based on both replica weights and current load, helping ensure efficient and stable request routing.

Dataflow Architecture and Pipelines

To support the movement towards data centric machine learning Seldon Core 2 follows a dataflow paradigm. By taking a decentralized route that focuses on the flow of data, users can have more flexibility and insight as they build and manage complex AI applications in production. This contrasts with more centralized orchestration approaches where data is secondary.

Kafka

Kafka is used as the backbone for Pipelines allowing decentralized, synchronous and asynchronous usage. This enables Models to be connected together into arbitrary directed acyclic graphs. Models can be reused in different Pipelines. The flow of data between models is handled by the dataflow engine using Kafka Streams.

By focusing on the data we allow users to join various flows together using stream joining concepts as shown below.

We support several types of joins:

  • inner joins, where all inputs need to be present for a transaction to join the tensors passed through the Pipeline;

  • outer joins, where only a subset needs to be available during the join window

  • triggers, in which data flows need to wait until records on one or more trigger data flows appear. The data in these triggers is not passed onwards from the join.

These techniques allow users to create complex pipeline flows of data between machine learning components.

More discussion on the data flow view of machine learning and its effect on v2 design can be found here.

Key Principle

Description

Flexible Workflows

Core 2 supports adaptable and scalable data pathways, accommodating various use cases and experiments. This ensures ML pipelines remain agile, allowing you to evolve the inference logic as requirements change..

Real-Time Data Streaming

Integrated data streaming capabilities allow you to view, store, manage, and process data in real time. This enhances responsiveness and decision-making, ensuring models work with the most up-to-date data for accurate predictions.

Standardized Processing

Core 2 promotes reusable and consistent data transformation and routing mechanisms. Standardized processing ensures data integrity and uniformity across applications, reducing errors and inconsistencies.

Comprehensive Monitoring

Detailed metrics and logs provide real-time visibility into data integrity, transformations, and flow. This enables effective oversight and maintenance, allowing teams to detect anomalies, drifts, or inefficiencies early.

Component Type

Description

Data Processors

Transform, filter, or aggregate data to ensure consistent, repeatable pre-processing.

Data Routers

Dynamically route data to different paths based on predefined rules for A/B testing, experimentation, or conditional logic.

Models

Perform inference tasks, including classification, regression, and Large Language Models (LLMs), hosted internally or via external APIs.

Supplementary Data Services

External services like vector databases that enable models to access embeddings and extended functionality.

Drift/Outlier Detectors & Explainers

Monitor model predictions for drift, anomalies, and explainability insights, ensuring transparency and performance tracking.

Experiment Type

Description

Traffic Splitting

Distributes inference requests across different models or pipelines based on predefined percentage splits. This enables A/B testing and comparison of multiple model versions. For example, Canary deployment of the models.

Mirror Testing

Sends a percentage of the traffic to a mirror model or pipeline without affecting the response returned to users. This allows evaluation of new models without impacting production workflows. For example, Shadow deployment of the models.

Pipelines
Servers
Experiments
Seldon Core 2 Data-Centric Approach

rClone

Learn how to configure and use Rclone for model artifact storage in Seldon Core, including cloud storage integration and authentication.

We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.

For authorization needed for cloud storage when running on Kubernetes see here.

Advanced Configurations

Data Science Monitoring

Installation Overview

Learn how to install Seldon Core 2 in various environments - from local development with Docker Compose to production-grade Kubernetes clusters.

Seldon Core 2 can be installed in various setups to suit different stages of the development lifecycle. The most common modes include:

Local Environment

Ideal for development and testing purposes, a local setup allows for quick iteration and experimentation with minimal overhead. Common tools include:

  • Docker Compose: Simplifies deployment by orchestrating Seldon Core components and dependencies in Docker containers. Suitable for environments without Kubernetes, providing a lightweight alternative.

  • Kind (Kubernetes IN Docker): Runs a Kubernetes cluster inside Docker, offering a realistic testing environment. Ideal for experimenting with Kubernetes-native features.

Production Environment

Designed for high-availability and scalable deployments, a production setup ensures security, reliability, and resource efficiency. Typical tools and setups include:

  • Managed Kubernetes Clusters: Platforms like GKE (Google Kubernetes Engine), EKS (Amazon Elastic Kubernetes Service), and AKS (Azure Kubernetes Service) provide managed Kubernetes solutions. Suitable for enterprises requiring scalability and cloud integration.

  • On-Premises Kubernetes Clusters: For organizations with strict compliance or data sovereignty requirements. Can be deployed on platforms like OpenShift or custom Kubernetes setups.

By selecting the appropriate installation mode—whether it's Docker Compose for simplicity, Kind for local Kubernetes experimentation, or production-grade Kubernetes for scalability—you can effectively leverage Seldon Core 2 to meet your specific needs.

Helm Charts

For more information, see the published .

For the description of (some) values that can be configured for these charts, see this .

Seldon Core 2 Dependencies

Here is a list of components that Seldon Core 2 requires, along with the minimum and maximum supported versions:

Notes:

  • Envoy and Rclone: These components are included as part of the Seldon Core 2 Docker images. You are not required to install them separately but must be aware of the configuration options supported by these versions.

  • Kafka: Only required for operating Seldon Core 2 dataflow Pipelines. If not needed, you should avoid installing seldon-modelgateway, seldon-pipelinegateway, and seldon-dataflow-engine.

  • Maximum Versions

Get started

Kubernetes Resources

For Kubernetes usage we provide a set of custom resources for interacting with Seldon.

  • - for installing Seldon in a particular namespace.

  • - for deploying sets of replicas of core inference servers (MLServer or Triton).

  • - for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.

Seldon Core Autoscaling

Seldon Core provides native autoscaling features for both Models and Servers, enabling automatic scaling based on inference load. The diagram below depicts an autoscaling implementation that uses both Model and Server autoscaling features native to Seldon Core (i.e. this implementation doesn't leverage HPA for autoscaling, an approach we cover )

Model Autoscaling

Models can automatically scale their replicas based on load. Enable it by setting MinReplicas or MaxReplicas in your model spec. For more detail on this setup see

Experiments - for testing new versions of models.

  • Pipelines - for connecting together flows of data between models.

  • Advanced Customization

    SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.

    • ServerConfigs - for defining new types of inference server that can be reference by a Server resource.

    • SeldonConfig - for defining how seldon is installed

    SeldonRuntime
    Servers
    Models

    Kafka

    3.4

    3.8

    Recommended (only required for operating Seldon Core 2 dataflow Pipelines)

    Prometheus

    2.0

    2.x

    Optional

    Grafana

    10.0

    ***

    Optional (no hard limit on the maximum version to be used)

    Prometheus-adapter

    0.12

    0.12

    Optional

    Opentelemetry Collector

    0.68

    ***

    Optional (no hard limit on the maximum version to be used)

    marked with
    ***
    indicates no hard limit on the version that can be used.

    Name of the Helm Chart

    Description

    seldon-core-v2-crds

    Cluster-wide installation of custom resources.

    seldon-core-v2-setup

    Installation of the manager to manage resources in the namespace or cluster-wide. This also installs default SeldonConfig and ServerConfig resources, allowing Runtimes and Servers to be installed on demand.

    seldon-core-v2-runtime

    Installs a SeldonRuntime custom resource that creates the core components in a namespace.

    seldon-core-v2-servers

    Installs Server custom resources providing example core servers to load models.

    Component

    Minimum Version

    Maximum Version

    Notes

    Kubernetes

    1.27

    1.33.0

    Required

    Envoy*

    1.32.2

    1.32.2

    Required. (Included in Core 2 installation)

    Rclone*

    1.68.2

    1.69.0

    Required. (Included in Core 2 installation)

    Helm charts
    Helm parameters section

    Learning environment

    Install Seldon Core 2 in Docker Compose, or Kind

    Production environment

    Install Seldon Core 2 in a Managed Kubernetes cluster, or On-Premises Kubernetes cluster

    Server Autoscaling

    Server autoscaling automatically scales Servers based on Model needs. This implementation supports scaling in a Multi-Model Serving setup where multiple models are hosted on shared inference servers. For more detail on this setup see here

    here
    here
    Core Model Server Autoscaling

    Scaling Seldon Services

    This page provides guidance about scaling Seldon Core 2 services

    Seldon Core 2 runs with several control and dataplane components. The scaling of these resources is discussed below:

    • Pipeline gateway: The pipeline gateway handles REST and gRPC synchronous requests to Pipelines. It is stateless and can be scaled based on traffic demand.

    • Model gateway: This component pulls model requests from Kafka and sends them to inference servers. It can be scaled up to the partition factor of your Kafka topics. At present we set a uniform partition factor for all topics in one installation of Seldon.

    • Dataflow engine: The dataflow engine runs KStream topologies to manage Pipelines. It can run as multiple replicas and the scheduler will balance Pipelines to run across it with a consistent hashing load balancer. Each Pipeline is managed up to the partition factor of Kafka (presently hardwired to one). We recommend using as many replicas of dataflow-engine as you have Kafka partitions in order to leverage the balanced distribution of inference traffic using hashing

    • Scheduler: The scheduler manages the control plane operations. It is presently required to be one replica as it maintains internal state within a BadgerDB held on local persistent storage (stateful set in Kubernetes). Performance tests have shown this not to be a bottleneck at present.

    • Kubernetes Controller: The Kubernetes controller manages resources updates on the cluster which it passes on to the Scheduler. It is by default one replica but has the ability to scale.

    • Envoy: Envoy replicas get their state from the scheduler for routing information and can be scaled as needed.

    Inference

    Learning Environment

    Install Seldon Core 2 in a local learning environment.

    You can install Seldon Core 2 on your local computer that is running a Kubernetes cluster using kind.

    Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies,see Helm charts and Dependencies.

    Note: These instructions guide you through installing Seldon Core 2 on a local Kubernetes cluster, focusing on ease of learning. Ensure your kind cluster is running on hardware with at least 32GB of RAM and a load balancer such as MetalLB is configured.

    Prerequisites

    • Install a Kubernetes cluster that is running version 1.27 or later.

    • Install , the Kubernetes command-line tool.

    • Install , the package manager for Kubernetes or , the automation tool used for provisioning, configuration management, and application deployment.

    Note: Ansible automates provisioning, configuration management, and handles all dependencies required for Seldon Core 2. With Helm, you need to configure and manage the dependencies yourself.

    Installing Seldon Core 2

    1. Create a namespace to contain the main components of Seldon Core 2. For example, create the seldon-mesh namespace.

    2. Add and update the Helm charts, seldon-charts, to the repository.

    3. Install Custom resource definitions for Seldon Core 2.

    Next Steps

    If you installed Seldon Core 2 using Helm, you need to complete the installation of other components in the following order:

    Multi-Model Serving

    Learn how to configure multi-model serving in Seldon Core, including resource optimization and model co-location.

    Multi-model Serving

    Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.

    This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.

    Models

    Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.

    • Seldon can handle a wide range of

    • Artifacts can be stored on any of the 40 or more cloud storage technologies as well as from local (mounted) folder as discussed .

    Parameterized Models

    The Model specification allows parameters to be passed to the loaded model to allow customization. For example:

    This capability is only available for MLServer custom model runtimes. The named keys and values will be added to the model-settings.json file for the provided model in theparameters.extra Dict. MLServer models are able to read these values in their load method.

    Example Parameterized Models

    Operational Monitoring

    Seldon Core 2 provides robust tools for tracking the performance and health of machine learning models in production.

    Monitoring

    • Real-Time metrics: collects and displays real-time metrics from deployed models, such as response times, error rates, and resource usage.

    Performance Tuning

    In MLOps, system performance can be defined as how efficiently and effectively an ML model or application operates in a production environment. It is typically measured across several key dimensions, such as latency, throughput, scalability, and resource-efficiency. These factors are deeply connected: changes in configuration often result in tradeoffs, and should be carefully considered. Specifically, latency, throughput and resource usage can all impact each other, and the approach to optimizing system performance depends on the desired balance of these outcomes in order to ensure a positive end-user experience, while also minimising infrastructure costs.

    High Level Approach

    There are many different levers that can be considered to tune performance for ML systems deployed with Core 2, across infrastructure, models, inference execution, and the related configurations exposed by Core 2. When reasoning about the performance of an ML-based system deployed using Seldon, we recommend breaking down the problem by first understanding and tuning the performance of deployed Models, and then subsequently considering more complex Pipelines composed of those models (if applicable). For both models and pipelines, it is important to run tests to understand baseline performance characteristics, before making efforts to tune these through changes to models, infrastructure, or inference configurations. The recommended approach can be broken down as follows:

    Using HPA for Autoscaling

    Overview of Horizontal Pod Autoscaler (HPA) scaling options in Seldon Core 2

    Given Seldon Core 2 is predominantly for serving ML in Kubernetes, it is possible to leverage HorizontalPodAutoscaler or to define scaling logic that automatically scales up and down Kubernetes resources. HPA targets Kubernetes or custom metrics to trigger scale-up or scale-down events for specified resources. Using HPA is recommnended if custom Scaling Metrics are required. These would be exposed using Prometheus, and or similar tools for explosing metrics to HPA. If these tools cause conflicts, that does not require exposing custom metrics is recommended.

    Seldon Core 2 provides two main approaches to leveraging Kubernetes Horizontal Pod Autoscaler (HPA) for autoscaling. It is important to remember that since in Core 2 Models and Servers are separate, autoscaling of both Models and Servers, in a coordinated way, needs to be accounted for when implementing autoscaling. In order to implement either approach, metrics first need to be exposed - this is explained in the guide which explains the fundamental requirements and configuration needed to enable HPA-based scaling in Seldon Core 2.

    Model performance tracking: monitors key performance indicators (KPIs) like accuracy, drift detection, and model degradation over time.
  • Custom metrics: allows you to define and track custom metrics specific to their models and use cases.

  • Visualization: Provides dashboards and visualizations to easily observe the status and performance of models.

  • There are two kinds of metrics present in Seldon Core 2 that you can monitor:

    • operational metrics

    • usage metrics

    Operational metrics describe the performance of components in the system. Some examples of common operational considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates. Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.

    Usage metrics describe the system at a higher and less dynamic level. Some examples include the number of deployed servers and models, and component versions. These are not typically metrics that engineers need insight into, but may be relevant to platform providers and operations teams.

    Pandas Query

    # samples/models/choice1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-one
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 1"
    Multi-model serving is enabled by design in Seldon Core 2. Based on requirements that are specified by the user on a given Model, the Scheduler will find an appropriate model inference server instance to load the model onto.

    In the below example, given that the model is tensorflow, the system will deploy the model onto atriton server instance (matching with the Server labels). Additionally as the model memory requirement is 100Ki, the system will pick the server instance that has enough (memory) capacity to host this model in parallel to other potentially existing models.

    All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.

    Overcommit

    Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.

    Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE on shared servers; it is set by default at 10%. In other words a given model inference server instance can register models with a total memory requirement up to MEMORY_REQUEST * ( 1 + SELDON_OVERCOMMIT_PERCENTAGE / 100).

    The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:

    • A new model load request beyond the active memory capacity of the inference server.

    • An incoming inference request to a registered model that is not loaded in-memory (previously evicted).

    This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).

    Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE to 0 for a given shared server.

    Note: Currently we are using memory requirement values that are specified by the user on the Server and Model side. In the future we are looking at how to make the system automatically handle memory management.

    Check this notebook for a local example.

    Kubernetes Example

    A Kubernetes yaml example is shown below for a SKLearn model for iris classification:

    Its Kubernetes spec has two core requirements

    • A storageUri specifying the location of the artifact. This can be any rclone URI specification.

    • A requirements list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.

    GRPC Example

    You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:

    The proto buffer definitions for the scheduler are outlined here.

    Multi-model Serving with Overcommit

    Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.

    Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying models. The system will find an appropriate server to load the model onto based on requirements that the user defines in the Model deployment definition.

    Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.

    See Multi-model serving for more information.

    Autoscaling of Models

    See here for discussion of autoscaling of models.

    Scheduling of Models onto Servers

    See here for details on how Core 2 schedules Models onto Servers.

    inference artifacts
    here
    # samples/models/tfsimple1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    # samples/models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    !grpcurl -d '{"model":{ \
                  "meta":{"name":"iris"},\
                  "modelSpec":{"uri":"gs://seldon-models/mlserver/iris",\
                               "requirements":["sklearn"],\
                               "memoryBytes":500},\
                  "deploymentSpec":{"replicas":1}}}' \
             -plaintext \
             -import-path ../../apis \
             -proto apis/mlops/scheduler/scheduler.proto  0.0.0.0:9004 seldon.mlops.scheduler.Scheduler/LoadModel
    Install Seldon Core 2 operator in the seldon-mesh namespace.

    This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.

    You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.

  • Install Seldon Core 2 runtimes in the seldon-mesh namespace.

    helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
    --namespace seldon-mesh \
    --install
  • Install Seldon Core 2 servers in the seldon-mesh namespace. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.

     helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
     --namespace seldon-mesh \
     --install
  • Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the seldon-mesh namespace. It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh.

     kubectl get pods -n seldon-mesh

    The output should be similar to this:

    NAME                                            READY   STATUS             RESTARTS      AGE
    hodometer-749d7c6875-4d4vw                      1/1     Running            0             4m33s
    mlserver-0                                      3/3     Running            0             4m10s
    seldon-dataflow-engine-7b98c76d67-v2ztq         0/1     CrashLoopBackOff   5 (49s ago)   4m33s
    seldon-envoy-bb99f6c6b-4mpjd                    1/1     Running            0             4m33s
    seldon-modelgateway-5c76c7695b-bhfj5            1/1     Running            0             4m34s
    seldon-pipelinegateway-584c7d95c-bs8c9          1/1     Running            0             4m34s
    seldon-scheduler-0                              1/1     Running            0             4m34s
    seldon-v2-controller-manager-5dd676c7b7-xq5sm   1/1     Running            0             4m52s
    triton-0                                        2/3     Running            0             4m10s
  • Note: Pods with names starting with seldon-dataflow-engine, seldon-pipelinegateway, and seldon-modelgateway may generate log errors until they successfully connect to Kafka. This occurs because Kafka is not yet fully integrated with Seldon Core 2.

    Note: For more information about configurations, see the supported versions of Python libraries, and customization options

    You can install Seldon Core 2 and its components using Ansible in one of the following methods:

    • Single command

    • Multiple commands

    Single command

    To install Seldon Core 2 into a new local kind Kubernetes cluster, you can use the seldon-all playbook with a single command:

    This creates a kind cluster and installs ecosystem dependencies such kafka, Prometheus, OpenTelemetry, and Jaeger as well as all the seldon-specific components. The seldon components are installed using helm-charts from the current git checkout (../k8s/helm-charts/).

    Internally this runs, in order, the following playbooks:

    • kind-cluster.yaml

    • setup-ecosystem.yaml

    • setup-seldon.yaml

    You may pass any of the additonal variables which are configurable for those playbooks to seldon-all.

    For example:

    Running the playbooks individually gives you more control over what and when it runs. For example, if you want to install into an existing k8s cluster.

    Multiple commands

    1. Create a kind cluster.

    2. Setup ecosystem.

      Seldon runs by default in the seldon-mesh namespace and a Jaeger pod and OpenTelemetry collector are installed in the namespace. To install in a different <mynamespace> namespace:

    3. Install Seldon Core 2 in the ansible/ folder.

    kubectl
    Helm
    Ansible
    Integrating with Kafka
    Installing a Service mesh
    kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"
    helm repo add seldon-charts https://seldonio.github.io/helm-charts/
    helm repo update seldon-charts
    helm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
    --namespace default \
    --install 
     helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
     --namespace seldon-mesh --set controller.clusterwide=true \
     --install

    Models

    1. Load testing to understand latency and throughput behaviour for one model replica

    2. Tuning performance for models. This can be done via changes to:

      1. Infrastructure - choosing the right hardware, and configurations in Core related to CPUs, GPUs and memory.

      2. Models - optimizing model artefacts in how they are structured, configured, stored. This can include model pruning, quantization, consideration of different model frameworks, and making sure that the model can achieve a high utilisation of the allocated resources.

      3. - the way in which inference is executed. This can include the choice of communication protocols (REST, gRPC), payload configuration, batching, and efficient execution of concurrent requests.

    Pipelines

    1. Testing Pipelines to identify the critical path based on performance of underlying models

    2. Core 2 Configuration to optimize data-processing through pipelines

    3. Scalability of Pipelines to understand how Core 2 components scale with the number of deployed pipelines and models

    1. Model Autoscaling with HPA

    The Model Autoscaling with HPA approach enables users to scale Models based on custom metrics. This approach, along with Server Autoscaling, enables users to customize the scaling logic for models, and automate the scaling of Servers based on the needs of the Models hosted on them.

    Model Autoscaling with HPA, Servers autoscaled by Core 2

    2. Model and Server Autoscaling with HPA

    The Model and Server Autoscaling with HPA approach leverages HPA to autoscale for Models and Servers in a coordinated way. This requires a 1-1 Mapping of Models and Servers (no Multi-Model Serving). In this case, setting up HPA can be set up for a Model and its associated Server, targetting the same custom metric (this is possible for Kubernetes-native metrics).

    Model and Server autoscaling with HPA, for single-model serving

    s

    HPA
    Prometheus Adapter
    autoscaling functionality native to Core 2
    HPA Setup

    Pandas Query

    Learn how to use PandasQuery for data transformation in Seldon Core, including query configuration and parameter handling.

    This model allows a Pandas query to be run in the input to select rows. An example is shown below:

    # samples/models/choice1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-one
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 1"

    This invocation check filters for tensor A having value 1.

    • The model also returns a tensor called status which indicates the operation run and whether it was a success. If no rows satisfy the query then just a status tensor output will be returned.

    • For further details, see Pandas query

    This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:

    and

    By including these in a Pipeline as follows we can define conditional routes:

    Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.

    For more details, see the Pandas query .

    Scheduling

    Learn how to configure model scheduling in Seldon Core, including resource allocation, scaling policies, and deployment strategies.

    Core 2 architecture is built around decoupling Model and Server CRs to allow for multi-model deployment, enabling multiple models to be loaded and served on one server replica or a single Pod. Multi-model serving allows for more efficient use of resources, see Multi Model Serving for more information.

    This architecture requires that Core 2 handles scheduling of models onto server pods natively. In particular Core 2 implements different sorters and filters which are used to find the best Server that is able to host a given Model. This process we describe in the following section.

    Scheduling Process

    Overview

    The scheduling process in Core 2 identifies a suitable candidate server for a given model through a series of steps. These steps involve sorting and filtering servers primarily based on the following criteria:

    • Server has matching Capabilities with Model spec.requirements.

    • Server has enough replicas to load the desired spec.replicas of the Model.

    • Each replica of Server has enough available memory to load one replica of the model defined in spec.memory.

    After a suitable candidate server is identified for a given model, Core 2 attempts to load the model onto it. If no matching server is found, the model is marked as ScheduleFailed.

    This process is designed to be extensible, allowing for the addition of new filters in future versions to enhance scheduling decisions.

    Note: A specific Model can only be assigned to at most one Server and therefore this Server requires enough replicas to host all replicas of the Model.

    Partial Scheduling

    Core 2 (from 2.9) is able to do partial scheduling of Models. Partial scheduling is defined as the loading of enough replicas of the model above spec.minReplicas and upto the number of available Server replicas. This allows the user a little bit more flexibility in serving traffic while optimising infrastructure provisioning.

    To enable partial scheduling, spec.minReplicas needs to be defined as it provides Core 2 the minimum replicas of the model that is required for serving.

    Partial scheduling does not have an explicit state; instead, the model is marked as ModelAvailable and is ready to serve traffic. The status of the Model CR can be inspected, where DESIRED REPLICAS and AVAILABLE REPLICAS provide insight into the number of replicas currently loaded in Core 2. Based on this information, the following logic are applied:

    • Fully Scheduled: READY is True and DESIRED REPLICAS is equal to AVAILABLE REPLICAS (STATUS is ModelAvailable)

    • Partially Scheduled: READY is TRUE and DESIRED REPLICAS is greater than AVAILABLE REPLICAS (STATUS is ModelAvailable)

    • Not Scheduled: Ready is False (Status is ScheduleFailed)

    Managing Kafka Topics

    Learn how to manage Kafka topics in Seldon Core 2, including topic creation, configuration, and monitoring for model inference and event streaming.

    Model Kafka topics

    A in Seldon Core 2 represents the fundamental unit for serving a machine learning artifact within a running server instance.

    If Kafka is installed in your cluster, Seldon Core automatically creates dedicated input and output topics for each model as it is loaded. These topics facilitate asynchronous messaging, enabling clients to send input messages and retrieve output responses independently and at a later time.

    By default, when a model is unloaded, the associated Kafka topics are preserved. This supports use cases like auditing, but can also lead to increased Kafka resource usage and unnecessary costs for workloads that don't require persistent topics.

    You can control this behavior by configuring the

    Production Environment

    Install Core 2 in a production Kubernetes environment.

    Prerequisites

    • Set up and connect to a Kubernetes cluster running version 1.27 or later. For instructions on connecting to your Kubernetes cluster, refer to the documentation provided by your cloud provider.

    • Install

    Examples

    Workflows

    Seldon inference is built from atomic Model components. Models as cover a wide range of artifacts including:

    • Core machine learning models, e.g. a PyTorch model.

    Load Testing

    Before looking to make changes to improve latency or throughput of your models, it is important to undergo load testing to understand the existing performance characteristics of your model(s) when deployed onto the chosen inference server (MLServer or Triton). The goal of load testing should be to understand the performance behavior of deployed models in saturated (i.e at the maximum throughput a model replica can handle) and non-saturated regimes, then compare it with expected latency objectives.

    The results here will also inform the setup of autoscaling parameters, with the target being to run each replica with some margin below the saturation throughput (say, by 10-20%) in order to ensure that latency does not degrade, and that there is sufficient capacity to absorb load for the time it takes for new inference server replicas (and model replicas) to become available.

    When testing latency, it is recommended to track different percentiles of latency (e.g. p50, p90, p95, p99). Choose percentiles based on the needs of your application - higher percentiles will help understand the variability of performance across requests.

    Experiments

    An Experiment defines a traffic split between Models or Pipelines. This allows new versions of models and pipelines to be tested.

    An experiment spec has three sections:

    • candidates (required) : a set of candidate models to split traffic.

    • default (optional) : an existing candidate who endpoint should be modified to split traffic as defined by the candidates.

    Autoscaling

    Autoscaling in Seldon Core 2

    Seldon Core 2 provides multiple approaches to scaling your machine learning deployments, allowing you to optimize resource utilization and handle varying workloads efficiently. In Core 2, we separate out Models and Servers, and Servers can have multiple Models loaded on them (Multi-Model Serving). Given this, setting up autoscaling requires defining the logic by which you want to scale your Models and then configuring the autoscaling of Servers such that they autoscale in a coordinated way. The following steps can be followed to set up autoscaling based on specific requirements:

    1. Identify metrics that you want to scale Models on. There are a couple of different options here:

    Explainability

    Learn how to implement model explainability in Seldon Core using Alibi-Explain integration for black box model explanations and pipeline insights.

    Explainers are Model resources with some extra settings. They allow a range of explainers from the Alibi-Explain library to be run on MLServer.

    An example Anchors explainer definitions is shown below.

    The key additions are:

    • type: This must be one of the supported by the Alibi Explain runtime in MLServer.

    • modelRef

    Server that already hosts the Model is preferred to reduce flip-flops between different candidate servers.

    Inference
    # samples/models/choice1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-one
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 1"
    notebook
    To install in a different namespace, <mynamespace>.
    dataflow
    section of the model specification. Alongside the required
    storageUri
    , and
    requirements
    fields, you can optionally include the
    cleanTopicsOnDelete
    flag. This boolean setting determines whether the associated Kafka topics should be deleted when the model is unloaded:
    • When set to false (the default), the topics remain after the model is deleted.

    • When set to true, both the input and output topics are removed when the model is unloaded.

    Here is an example of a manifest file that enables topic cleanup on deletion:

    To inspect existing Kafka topics in your cluster, you can deploy a temporary Pod:

    After the Pod is running, you can access it and list topics with the following command:

    Deploying and verifying topic cleanup for a model

    Apply the model manifest with topic cleanup enabled:

    After deployment, you can list Kafka topics from within the kafka-busybox pod and confirm that input/output topics have been created:

    To delete the model:

    After deletion, list the topics again. You should see that the input and output topics have been successfully removed from Kafka:

    Pipeline Kafka topics

    Similar to models, when a Pipeline is deployed in Seldon Core 2, Kafka input and output topics are automatically created for it. These topics enable asynchronous processing across pipeline steps.

    As with models, the cleanTopicsOnDelete flag controls whether these topics are retained or removed when the pipeline is deleted:

    • By default, topics are retained after the pipeline is unloaded.

    • When cleanTopicsOnDelete is set to true, the input and output topics associated with the pipeline are deleted.

    Here is an example of a pipeline manifest that wraps the previously defined model and enables topic cleanup:

    Deploying and verifying topic cleanup for a pipeline

    Apply the pipeline manifest with topic cleanup enabled:

    After the pipeline is deployed, you can list the Kafka topics from inside the kafka-busybox pod to confirm that they have been created:

    To delete the pipeline, run:

    After deletion, list the Kafka topics again. You should observe that the pipeline's input and output topics have been removed:

    Note: Topics linked to models within a pipeline are not deleted unless those models are explicitly unloaded and their specifications have cleanTopicsOnDelete set to true.

    Model
    : The model name for black box explainers.
  • pipelineRef: The pipeline name for black box explainers.

  • Only one of modelRef and pipelineRef is allowed.

    Pipeline Explanations

    Blackbox explainers can explain a Pipeline as well as a model. An example from the Huggingface sentiment demo is show below.

    Examples

    • Tabular income classification model with Anchor Tabular black box model explainer

    • Huggingface Sentiment model with Anchor Text black box pipeline explainer

    • Anchor Text movies sentiment explainer

    supported Alibi Explainer types
    # samples/models/choice2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-two
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 2"
    # samples/pipelines/choice.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: choice
    spec:
      steps:
      - name: choice-is-one
      - name: mul10
        inputs:
        - choice.inputs.INPUT
        triggers:
        - choice-is-one.outputs.choice
      - name: choice-is-two
      - name: add10
        inputs:
        - choice.inputs.INPUT
        triggers:
        - choice-is-two.outputs.choice
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    ansible-playbook playbooks/seldon-all.yaml
    ansible-playbook playbooks/seldon-all.yaml -e seldon_mesh_namespace=my-seldon-mesh -e install_prometheus=no -e @playbooks/vars/set-custom-images.yaml
    ansible-playbook playbooks/kind-cluster.yaml
    ansible-playbook playbooks/setup-ecosystem.yaml
    ansible-playbook playbooks/setup-ecosystem.yaml -e seldon_mesh_namespace=<mynamespace>
    ansible-playbook playbooks/setup-seldon.yaml
    ansible-playbook playbooks/setup-seldon.yaml -e seldon_mesh_namespace=<mynamespace>
    # samples/models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      dataflow:
        cleanTopicsOnDelete: true
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
      requirements:
        - sklearn
      memory: 100Ki
    apiVersion: v1
    kind: Pod
    metadata:
      name: kafka-busybox
    spec:
      containers:
        - name: kafka-busybox
          image: apache/kafka:latest
          command: ["sleep", "3600"]
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
    kafka-busybox:/opt/kafka/bin$ ./kafka-topics.sh --list --bootstrap-server $SELDON_KAFKA_BOOTSTRAP_PORT_9092_TCP
    kubectl apply -f model.yaml -n seldon-mesh
    __consumer_offsets
    seldon.seldon-mesh.model.iris.inputs
    seldon.seldon-mesh.model.iris.outputs
    kubectl delete -f model.yaml -n seldon-mesh
    __consumer_offsets
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: iris-pipeline
    spec:
      dataflow:
        cleanTopicsOnDelete: true
      steps:
        - name: iris
      output:
        steps:
        - iris
    kubectl apply -f pipeline.yaml -n seldon-mesh
    __consumer_offsets
    seldon.seldon-mesh.errors.errors
    seldon.seldon-mesh.model.iris.inputs
    seldon.seldon-mesh.model.iris.outputs
    seldon.seldon-mesh.pipeline.iris-pipeline.inputs
    seldon.seldon-mesh.pipeline.iris-pipeline.outputs
    kubectl delete -f pipeline.yaml -n seldon-mesh
    __consumer_offsets
    seldon.seldon-mesh.errors.errors
    seldon.seldon-mesh.model.iris.inputs
    seldon.seldon-mesh.model.iris.outputs
    # samples/models/income-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/income-sklearn/anchor-explainer"
      explainer:
        type: anchor_tabular
        modelRef: income
    # samples/models/hf-sentiment-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
      explainer:
        type: anchor_text
        pipelineRef: sentiment-explain
    , the Kubernetes command-line tool.
  • Install Helm, the package manager for Kubernetes.

  • To use Seldon Core 2 in a production environment:

    1. Create namespaces

    2. Install Seldon Core 2

    Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies, see Helm charts and Dependencies.

    Creating Namespaces

    • Create a namespace to contain the main components of Seldon Core 2. For example, create the namespace seldon-mesh:

    • Create a namespace to contain the components related to monitoring. For example, create the namespace seldon-monitoring:

    Installing Seldon Core 2

    1. Add and update the Helm charts seldon-charts to the repository.

    2. Install custom resource definitions for Seldon Core 2.

    3. Install Seldon Core 2 operator in the seldon-mesh namespace.

      This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.

      With cluster-wide installation, you can specify the namespaces to watch by setting controller.watchNamespaces to a comma-separated list of namespaces (e.g., {ns1, ns2}). This allows the Seldon Core 2 operator to monitor and manage resources in those namespaces.

      You can also install multiple operators in different namespaces, and configure them to watch a disjoint set of namespaces. For example, you can install two operators in op-ns1 and op-ns2, and configure them to watch ns1, ns2 and ns3, ns4, respectively, using the following commands:

      We now install the first operator in op-ns1 and configure it to watch ns1 and ns2:

      Next, we install the second operator in op-ns2 and configure it to watch ns3 and ns4:

      Note that the second operator is installed with skipClusterRoleCreation=true to avoid re-creating the ClusterRole and ClusterRoleBinding that were created by the first operator.

      Finally, you can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.

    4. Install Seldon Core 2 runtimes in the namespace seldon-mesh.

    5. Install Seldon Core 2 servers in the namespace seldon-mesh. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.

    6. Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the namespace seldon-mesh:

      The output should be similar to this:

    Note: Pods with names starting with seldon-dataflow-engine, seldon-pipelinegateway, and seldon-modelgateway may generate log errors until they successfully connect to Kafka. This occurs because Kafka is not yet fully integrated with Seldon Core 2.

    Next steps

    You can integrate Seldon Core 2 with Kafka that is self-hosted or a managed Kafka.

    Additional Resources

    • Seldon Enterprise Documentation

    • GKE Documentation

    • AWS Documentation

    • Azure Documentation

    kubectl
    Feature transformations that might be built with custom python code.
  • Drift detectors.

  • Outlier detectors.

  • Explainers

  • Adversarial detectors.

  • A typical workflow for a production machine learning setup might be as follows:

    1. You create a Tensorflow model for your core application use case and test this model in isolation to validate.

    2. You create SKLearn feature transformation component before your model to convert the input into the correct form for your model. You also create Drift and Outlier detectors using Seldon's open source Alibi-detect library and test these in isolation.

    3. You join these components together into a Pipeline for the final production setup.

    These steps are shown in the diagram below:

    Examples & Tutorials

    This section will provide some examples to allow operations with Seldon to be tested so you can run your own models, experiments, pipelines and explainers.

    Getting Started Examples

    • Local examples

    • Kubernetes examples

    Models

    • Huggingface models

    • Model zoo

    • Artifact versions

    Pipelines

    • Pipeline examples

    • Pipeline to pipeline examples

    • Cyclic Pipeline

    Explainers

    • Explainer examples

    Servers

    • Custom Servers

    Experiments

    • Local experiments

    • Experiment version examples

    Making Inference Requests

    • Inference examples

    • Tritonclient examples

    • Batch Inference examples (kubernetes)

    • Batch Inference examples (local)

    Misc

    • Checking Pipeline readiness

    • Local Overcommit

    Further Kubernetes Examples

    • Kubernetes custerwide example

    Advanced Examples

    • Huggingface speech to sentiment with explanations pipeline

    • Production image classifier with drift and outlier monitoring

    • Production income classifier with drift, outlier and explanations

    • Conditional pipeline with pandas query model

    shown here
    Determining the Load Saturation point

    A first target of load testing should be determining the maximum throughput that one model replica is able to sustain. We’ll refer to this as the (single replica) load saturation point. Increasing the number of inference requests per second (RPS) beyond this point degrades latency due to queuing that occurs at the bottleneck (a model processing step, contention on a resource such as CPU, memory, network, etc).

    We recommend determining the load saturation point by running an open-loop load test that goes through a series of stages, each having a target RPS. A stage would first linearly ramp-up to its target RPS and then hold that RPS level constant for a set amount of time. The target RPS should monotonically increase between the stages.

    Load Saturation Point

    In order to get reproducible load test results, it is recommended to run the inference servers corresponding to the models being load-tested on k8s nodes where other components are not concurrently using a large proportion of the shared resources (compute, memory, IO).

    Closed-loop vs. Open-loop mode

    Typically, load testing tools generate load by creating a number of "virtual users" that send requests. Knowing the behaviour of those virtual users is critical when interpreting the results of the load test. Load tests can be set up in closed-loop mode, meaning that when each virtual user sends a request and they wait for a response before sending the next one. Alternatively, there is open-loop mode, where a variable number of users are instantiated in order to maintain a constant overall RPS.

    • When running in closed-loop mode, an undesired side-effect called coordinated omission appears: when the system gets overloaded and latency spikes up, the load test users effectively reduce the actual load on the system by sending requests less frequently

    • When using closed-loop mode in load testing, be aware that reported latencies at a given throughput may be significantly smaller than what will be experienced in reality. In contrast, an open-loop load tester would maintain a constant RPS, resulting in a more accurate representation of the latencies that will be experienced when running the model in production.

    • You can refer to the documentation of your load testing tool (i.e Locust, k6) for guidance on choosing the right workload model (open vs. closed-loop) based on your testing goals.

    Each candidate has a traffic weight. The percentage of traffic will be this weight divided by the sum of traffic weights.

  • mirror (optional) : a single model to mirror traffic to the candidates. Responses from this model will not be returned to the caller.

  • An example experiment with a defaultModel is shown below:

    This defines a split of 50% traffic between two models iris and iris2. In this case we want to expose this traffic split on the existing endpoint created for the iris model. This allows us to test new versions of models (in this case iris2) on an existing endpoint (in this case iris). The default key defines the model whose endpoint we want to change. The experiment will become active when both underplying models are in Ready status.

    An experiment over two separate models which exposes a new API endpoint is shown below:

    To call the endpoint add the header seldon-model: <experiment-name>.experiment in this case:seldon-model: experiment-iris.experiment. For example with curl:

    For examples see the local experiments notebook.

    Pipeline Experiments

    Running an experiment between some pipelines is very similar. The difference is resourceType: pipeline needs to be defined and in this case the candidates or mirrors will refer to pipelines. An example is shown below:

    For an example see the local experiments notebook.

    Mirror Experiments

    A mirror can be added easily for model or pipeline experiments. An example model mirror experiment is shown below:

    For an example see the local experiments notebook.

    An example pipeline mirror experiment is shown below:

    For an example see the local experiments notebook.

    Sticky Sessions

    To allow cohorts to get consistent views in an experiment each inference request passes back a response header x-seldon-route which can be passed in future requests to an experiment to bypass the random traffic splits and get a prediction from the sequence of models and pipelines used in the initial request.

    Note: you must pass the normal seldon-model header along with the x-seldon-route header.

    This is illustrated in the local experiments notebook.

    Caveats: the models used will be the same but not necessarily the same replica instances. This means at present this will not work for stateful models that need to go to the same model replica instance.

    Service Meshes

    As an alternative you can choose to run experiments at the service mesh level if you use one of the popular service meshes that allow header based routing in traffic splits. For further discussion see here.

    1. Core 2 natively supports scaling based on Inference Lag, meaning the difference between incoming and outgoing requests for a model in a given period of time. This is done by configuring minReplicas or maxReplicas in the Model CRDs and making sure you configure the Core 2 install with the autoscaling.autoscalingModelEnabled helm value set to true (default is false).

    2. Users can expose custom or Kubernetes-native metrics, and then target the scaling of models based on those metrics by using HorizontalPodAutoscaler. This requires exposing the right metrics, using the monitoring tool of your choice (e.g. Prometheus).

    Once the approach for Model scaling is implemented, Server scaling needs to be configured.

    1. Implement Server Scaling by either:

      1. Enabling Autoscaling of Servers based on Model needs. This is managed by Seldon's scheduler, and is enabled by setting minReplicas and maxReplicas in the Server Custom Resource and making sure you configure the Core 2 install with the autoscaling.autoscalingServerEnabled helm value set to true (the default)

      2. If Models and Servers are to have a one-to-one mapping (no Multi-Model Serving) then users can also define scaling of Servers using an HPA manifest that matches the HPA applied to the associated Models. This approach is outlined . This approach will only work with custom metrics, as Kubernetes does not allow mutliple HPAs to target the same metrics from Kubernetes directly.

    Based on the requirements above, one of the following three options for coordinated autoscaling of Models and Servers can be chosen:

    Scaling Approach
    Scaling Metric
    Multi-Model Serving
    Pros
    Cons

    Inference lag

    ✅

    - Simplest Implementation - One metric across models

    - Model scale down only when no inference traffic exists to that model

    User-defined (HPA)

    ✅

    - Custom scaling metric

    - Requires Metrics store integration (e.g. Prometheus) - Potentially suboptimal Server packing on scale down

    Alternatively, the following decision-tree showcases the approaches we recommend based on users' requirements:

    Autoscaling Approach Decision-tree

    Scaling Seldon Services

    When running Core 2 at scale, it is important to understand the scaling behaviour of Seldon's services as well as the scaling of the Models and Servers themselves. This is outlined in the Scaling Core Services page.

    Upgrading

    Upgrading from 2.9 - 2.10

    All CRD changes maintain backward compatibility with existing CRs. We introduce new Core 2 scaling configuration options in SeldonConfig (config.ScalingConfig.*), with a wider goal of centralising Core 2 configuration and allowing for configuration changes after the Core 2 cluster is deployed. To ensure a smooth transition, some of the configuration options will only take effect starting from the next releases, but end-users are encouraged to set them to the desired values before upgrading to the next release (2.11).

    Upgrading when using helm is seamless, with existing helm values being used to fill in new configuration options. If not using helm, previous SeldonConfig CRs remain valid, but restrictive defaults will be used for the scaling configuration. One parameter in particular, maxShardCountMultiplier docs will need to be set in order to take advantage of the new pipeline scalability features. This parameter can be changed and the effects of its value will be propagated to all components that use the config.

    For full release notes, see .

    Upgrading from 2.8 - 2.9

    Though there are no breaking changes between 2.8 and 2.9, there are some new functionalties offered that require changes to fields in our CRDs:

    • In Core 2.9 you can now set minReplicas to enable of Models. This means that users will no longer have to wait for the full set of desired replicas before loading models onto servers (e.g. when scaling up).

    • We've also added a spec.llm field to the Model CRD . The field is used by the PromptRuntime in Seldon's to reference a LLM model. Only one of spec.llm and spec.explainer should be set at a given time. This allows the deployment of multiple "models" acting as prompt generators for the same LLM.

    • Due to the introduction of Server-autoscaling, it is important to understand what type of autoscaling you want to leverage, and how that can be configured. Below are configuratation that help set autoscaling behaviour. All options here have corresponding command-line arguments that can be passed to seldon-scheduler when not using helm as the install method. The following helm values can be set

    Upgrading from 2.7 - 2.8

    Core 2.8 introduces several new fields in our CRDs:

    • statefulSetPersistentVolumeClaimRetentionPolicy enables users to configure the cleaning of PVC on their servers. This field is set to retain as default.

    • Status.selector was introduced as a mandatory field for models in 2.8.4 and made optional in 2.8.5. This field enables autoscaling with HPA.

    • PodSpec in the OverrideSpec

    These added fields do not result in breaking changes, apart from 2.8.4 which required the setting of the Status.selector upon upgrading. This field was however changed to optional in the subsequent 2.8.5 release. Updating the CRDs (e.g. via helm) will enable users to benefit from the associated functionality.

    Upgrading from 2.6 - 2.7

    All pods provisioned through the operator i.e. SeldonRuntime and Server resources now have the label app.kubernetes.io/name for identifying the pods.

    Previously, the labelling has been inconsistent across different versions of Seldon Core 2, with mixture of app and app.kubernetes.io/name used.

    If using the Prometheus operator ("Kube Prometheus"), please apply the v2.7.0 manifests for Seldon Core 2 according to the .

    Note that these manifests need to be adjusted to discover metrics endpoints based on the existing setup.

    If previous pod monitors had namespaceSelector fields set, these should be copied over and applied to the new manifests.

    If namespaces do not matter, cluster-wide metrics endpoint discovery can be setup by modifying thenamespaceSelector field in the pod monitors:

    Upgrading from 2.5 - 2.6

    Release 2.6 brings with it new custom resources SeldonConfig and SeldonRuntime, which provide a new way to install Seldon Core 2 in Kubernetes. Upgrading in the same namespace will cause downtime while the pods are being recreated. Alternatively users can have an external service mesh or other means to be used over multiple namespaces to bring up the system in a new namespace and redeploy models before switch traffic between them.

    If the new 2.6 charts are used to upgrade in an existing namespace models will eventually be redeloyed but there will be service downtime as the core components are redeployed.

    Test the Installation

    Learn how to verify your Seldon Core installation by running tests and checking component functionality.

    To confirm the successful installation of Seldon Core 2, Kafka, and the service mesh, deploy a sample model and perform an inference test. Follow these steps:

    Deploy the Iris Model

    1. Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:

    The output is:

    1. Verify that the model is deployed in the namespace seldon-mesh.

    When the model is deployed, the output is similar to:

    Deploy a pipeline for the Iris Model

    Note: The pipeline name must not be reused as the name of any individual step within the pipeline. This results in a Kubernetes validation error: pipeline iris must not have a step name with the same name as pipeline name

    1. Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:

    The output is:

    1. Verify that the pipeline is deployed in the namespace seldon-mesh.

    When the pipeline is deployed, the output is similar to:

    Perform an Inference test

    1. Use curl to send a test inference request to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:

    • The Host header matches the expected virtual host configured in your service mesh.

    • The Seldon-Model header specifies the correct model name.

    The output is similar to:

    1. Use curl to send a test inference request through the pipeline to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:

    • The Host header matches the expected virtual host configured in your service mesh.

    • The Seldon-Model header specifies the correct pipeline name.

    • To route inference requests to a pipeline endpoint, include the .pipeline suffix in the model name within the request header. This distinguishes the pipeline from a model that shares the same base name.

    The output is similar to:

    Artifact versions

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see .

    Seldon V2 Kubernetes Multi Version Artifact Examples

    Servers

    Learn how to configure and manage inference servers in Seldon Core 2, including MLServer and Triton server farms, model scheduling, and capability management.

    By default Seldon installs two server farms using MLServer and Triton with 1 replica each. Models are scheduled onto servers based on the server's resources and whether the capabilities of the server matches the requirements specified in the Model request. For example:

    This model specifies the requirement sklearn

    There is a default capabilities for each server as follows:

    • MLServer

    • Triton

    kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"
    kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
    helm repo add seldon-charts https://seldonio.github.io/helm-charts/
    helm repo update seldon-charts
    helm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
    --namespace default \
    --install 
     helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
     --namespace seldon-mesh --set controller.clusterwide=true \
     --install
    curl http://${MESH_IP}/v2/models/experiment-iris/infer \
       -H "Content-Type: application/json" \
       -H "seldon-model: experiment-iris.experiment" \
       -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    kubectl apply -f - --namespace=seldon-mesh <<EOF
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
        - sklearn
    EOF
    
    Kubernetes Server with PVC

    Model and Server Autoscaling with HPA

    User-defined (HPA)

    ❌

    - Coordinated Model and Server scaling

    - Requires Metrics store integration (e.g. Prometheus) - No Multi-Model Serving

    here
    Seldon Core Autoscaling
    Model Autoscaling with HPA
    • autoscaling.autoscalingModelEnabled, with corresponding cmd line arg: --enable-model-autoscaling (defaults to false): enable or disable native model autoscaling based on lag thresholds. Enabling this assumes that lag (number of inference requests "in-flight") is a representative metric based on which to scale your models in a way that makes efficient use of resources.

    • autoscaling.autoscalingServerEnabled with corresponding cmd line arg: --enable-server-autoscaling (defaults to "true"): enable to use native server autoscaling, where the number of server replicas is set according to the number of replicas required by the models loaded onto that server.

    • autoscaling.serverPackingEnabled with corresponding cmd line arg: --server-packing-enabled (experimental, defaults to "false"): enable server packing to try and reduce the number of server replicas on model scale-down.

    • autoscaling.serverPackingPercentage with corresponding cmd line arg: --server-packing-percentage (experimental, defaults to "0.0"): controls the percentage of model replica removals (due to model scale-down or deletion) that should trigger packing

    for
    SeldonRuntimes
    enables users to customize how Seldon Core 2 pods are created. In particular, this also allows for setting custom taints/tolerations, adding additional containers to our pods, configuring custom security settings.
    here
    partial scheduling
    LLM Module
    metrics documentation
    We have a Triton model that has two version folders

    Model 1 adds 10 to input, Model 2 multiples by 10 the input. The structure of the artifact repo is shown below:

    Before you begin

    1. Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.

    2. Ensure that you are performing these steps in the directory where you have downloaded the samples.

    3. Get the IP address of the Seldon Core 2 instance running with Istio:

    Make a note of the IP address that is displayed in the output. Replace <INGRESS_IP> with your service mesh's ingress IP address in the following commands.

    Model

    Seldon CLI
    Custom Capabilities

    Servers can be defined with a capabilities field to indicate custom configurations (e.g. Python dependencies). For instance:

    These capabilities override the ones from the serverConfig: mlserver. A model that takes advantage of this is shown below:

    This above model will be matched with the previous custom server mlserver-134.

    Servers can also be set up with the extraCapabilities that add to existing capabilities from the referenced ServerConfig. For instance:

    This server, mlserver-extra, inherits a default set of capabilities via serverConfig: mlserver. These defaults are discussed above. The extraCapabilities are appended to these to create a single list of capabilities for this server.

    Models can then specify requirements to select a server that satisfies those requirements as follows.

    The capabilities field takes precedence over the extraCapabilities field.

    For some examples see here.

    Autoscaling of Servers

    Within docker we don't support this but for Kubernetes see here

    for ns in op-ns1 op-ns2 ns1 ns2 ns3 ns4; do kubectl create ns "$ns"; done
    helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
    --namespace op-ns1 \
    --set controller.clusterwide=true \
    --set "controller.watchNamespaces={ns1,ns2}" \
    --install
    helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
    --namespace op-ns2 \
    --set controller.clusterwide=true \
    --set "controller.watchNamespaces={ns3,ns4}" \
    --set controller.skipClusterRoleCreation=true \
    --install
    helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
    --namespace seldon-mesh \
    --install
     helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
     --namespace seldon-mesh \
     --install
     kubectl get pods -n seldon-mesh
    NAME                                            READY   STATUS             RESTARTS      AGE
    hodometer-749d7c6875-4d4vw                      1/1     Running            0             4m33s
    mlserver-0                                      3/3     Running            0             4m10s
    seldon-dataflow-engine-7b98c76d67-v2ztq         0/1     CrashLoopBackOff   5 (49s ago)   4m33s
    seldon-envoy-bb99f6c6b-4mpjd                    1/1     Running            0             4m33s
    seldon-modelgateway-5c76c7695b-bhfj5            1/1     Running            0             4m34s
    seldon-pipelinegateway-584c7d95c-bs8c9          1/1     Running            0             4m34s
    seldon-scheduler-0                              1/1     Running            0             4m34s
    seldon-v2-controller-manager-5dd676c7b7-xq5sm   1/1     Running            0             4m52s
    triton-0                                        2/3     Running            0             4m10s
    spec:
      namespaceSelector:
        any: true
    model.mlops.seldon.io/iris created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/iris condition met
    kubectl apply -f - --namespace=seldon-mesh <<EOF
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: irispipeline
    spec:
      steps:
        - name: iris
      output:
        steps:
        - iris
    EOF
    pipeline.mlops.seldon.io/irispipeline created
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/irispipeline condition met
    curl -k http://<INGRESS_IP>:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: iris" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "shape": [1, 4],
            "datatype": "FP32",
            "data": [[1, 2, 3, 4]]
          }
        ]
      }'
    {"model_name":"iris_1","model_version":"1","id":"f4d8b82f-2af3-44fb-b115-60a269cbfa5e","parameters":{},"outputs":[{"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"},"data":[2]}]}
    curl -k http://<INGRESS_IP>:80/v2/models/irispipeline/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: irispipeline.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "shape": [1, 4],
            "datatype": "FP32",
            "data": [[1, 2, 3, 4]]
          }
        ]
      }'
    {"model_name":"","outputs":[{"data":[2],"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"}}]}
    curl -k http://<INGRESS_IP>:80/v2/models/math/infer \                           
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: math" \
      -d '{
        "model_name": "math",
        "inputs": [
          {
            "name": "INPUT",
            "datatype": "FP32",
            "shape": [4],
            "data": [1, 2, 3, 4]
          }
        ]
      }' | jq
    {
      "model_name": "math_1",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            4
          ],
          "data": [
            11.0,
            12.0,
            13.0,
            14.0
          ]
        }
      ]
    }
    
    seldon model infer math --inference-mode grpc --inference-host <INGRESS_IP>:80 \
      '{"model_name":"math","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "modelName": "math_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              11,
              12,
              13,
              14
            ]
          }
        }
      ]
    }
    
    curl -k http://<INGRESS_IP>:80/v2/models/math/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: math" \
      -d '{
        "model_name": "math",
        "inputs": [
          {
            "name": "INPUT",
            "datatype": "FP32",
            "shape": [4],
            "data": [1, 2, 3, 4]
          }
        ]
      }' | jq
    {
      "model_name": "math_2",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            4
          ],
          "data": [
            10.0,
            20.0,
            30.0,
            40.0
          ]
        }
      ]
    }
    
    seldon model infer math --inference-mode grpc --inference-host <INGRESS_IP>:80 \
      '{"model_name":"math","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "modelName": "math_2",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              10,
              20,
              30,
              40
            ]
          }
        }
      ]
    }
    
    config.pbtxt
    1/model.py <add 10>
    2/model.py <mul 10>
    
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    cat ./models/multi-version-1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: math
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/multi-version"
      artifactVersion: 1
      requirements:
      - triton
      - python
    
    kubectl apply -f ./models/multi-version-1.yaml -n seldon-mesh
    model.mlops.seldon.io/math created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/math condition met
    
    cat ./models/multi-version-2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: math
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/multi-version"
      artifactVersion: 2
      requirements:
      - triton
      - python
    
    kubectl apply -f ./models/multi-version-2.yaml -n seldon-mesh
    model.mlops.seldon.io/math configured
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/math condition met
    
    kubectl delete -f ./models/multi-version-1.yaml -n seldon-mesh
    model.mlops.seldon.io "math" deleted
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    - name: SELDON_SERVER_CAPABILITIES
      value: "mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost"
    - name: SELDON_SERVER_CAPABILITIES
      value: "triton,dali,fil,onnx,openvino,python,pytorch,tensorflow,tensorrt"
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-134
    spec:
      serverConfig: mlserver
      capabilities:
      - mlserver-1.3.4
      podSpec:
        containers:
        - image: seldonio/mlserver:1.3.4
          name: mlserver
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - mlserver-1.3.4
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-extra
    spec:
      serverConfig: mlserver
      extraCapabilities:
      - extra
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: extra-model-requirements
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - extra

    Model and Server Autoscaling with HPA

    Learn how to implement HPA-based autoscaling for both Models and Servers in single-model serving deployments

    This page describes how to implement autoscaling for both Models and Servers using Kubernetes HPA (Horizontal Pod Autoscaler) in a single-model serving setup. This approach is specifically designed for deployments where each Server hosts exactly one Model replica (1:1 mapping between Models and Servers).

    Overview

    In single-model serving deployments, you can use HPA to scale both Models and their associated Servers independently, while ensuring they scale in a coordinated manner. This is achieved by:

    1. Setting up HPA for Models based on custom metrics (e.g., RPS)

    2. Setting up matching HPA configurations for Servers

    3. Ensuring both HPAs target the same metrics and scaling policies

    Key Considerations

    • Only custom metrics from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.

    • No Multi-Model Serving - this approach requires a 1-1 mapping of Models and Servers, meaning that consolidation of multiple Models onto shared Servers (Multi-Model Serving) is not possible.

    Implementation

    In order to set up HPA Autoscaling for Models and Servers together, metrics need to be exposed in the same way as is explained in the tutorial. If metrics have not yet been exposed, follow that workflow until you are at the point where you will configure and apply the HPA manifests.

    In this implementation, Models and Servers will be configured to autoscale with a separate HPA manifest targetting the same metric as Models. In order for the scaling metrics to be in sync, it's important to apply the HPA manifests simultaneously. It is important to keep both the scaling metric and any scaling policies the same across the two HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals.

    Here's an example configuration, utilizing the same infer_rps metric as was set up in the .

    Scaling Behavior

    In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas and maxReplicas, as well as any other configured scaling policies should be kept in sync across the HPA for the model and the server.

    Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds). As a side effect of this, creating the Model HPA and the Server HPA (for a given model) at different times will mean that the scaling decisions on the two are taken at different times. Even when creating the two CRs together as part of the same manifest, there will usually be a small delay between the point where the Model and Server spec.replicas values are changed. Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).

    Note: If a Model gets scaled up slightly before its corresponding Server, the model is currently marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new Server replicas become available. However, the existing replicas of that model remain available and will continue to serve inference load.

    Monitoring Scaling

    When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value]/[target]. This information is updated in accordance to the sampling rate of each HPA resource. It is therefore expected to sometimes see different metric values for the Model and its corresponding Server.

    Some versions of k8s will display [per pod metric value] instead of [per replica metric value], with the number of pods being computed based on a label selector present in the target resource CR (the status.selector value for the Model or Server in the Core 2 case).

    HPA is designed so that multiple HPA CRs cannot target the same underlying pod with this selector (with HPA stopping when such a condition is detected). This means that in Core 2, the Model and Server selector cannot be the same. A design choice was made to assign the Model a unique selector that does not match any pods.

    As a result, for the k8s versions displaying [per pod metric value], the information shown for the Model HPA CR will be an overflow caused by division by zero. This is only a display artefact, with the Model HPA continuing to work normally. The actual value of the metric can be seen by inspecting the corresponding Server HPA CR, or by fetching the metric directly via

    Limitations

    In this implementation, the scheduler itself does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady condition transitions to Status: False with a ScheduleFailed message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.

    Best Practices

    The following elements are important to take into account when setting the HPA policies.

    • The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:

      • The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds) is not recommended because of this.

      • Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway.

    • The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.

      • The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.

      • Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.

    Huggingface models

    Text Generation Model

    Load the model

    kubectl apply -f ./models/hf-text-gen.yaml
    model.mlops.seldon.io/text-gen created
    seldon model load -f ./models/hf-text-gen.yaml
    {}

    Wait for the model to be ready

    kubectl get model text-gen -n ${NAMESPACE} -o json | jq -r '.status.conditions[] | select(.message == "ModelAvailable") | .status'
    True
    seldon model status text-gen -w ModelAvailable | jq -M .
    {}

    Do a REST inference call

    Unload the model

    Custom Text Generation Model

    Load the model

    Unload the model

    Custom Servers

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see .

    Custom Server with Capabilities

    Outlier Detection

    Learn how to implement outlier detection in Seldon Core using Alibi-Detect integration for model monitoring and anomaly detection.

    Outlier detection models are treated as any other Model. You can run any saved outlier detection model by adding the requirement alibi-detect.

    An example outlier detection model from the CIFAR10 image classification example is shown below:

    Examples

    cat ./models/hf-text-gen.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: text-gen
    spec:
      storageUri: "gs://seldon-models/mlserver/huggingface/text-generation"
      requirements:
      - huggingface
    

    REST

    kubectl get --raw

    The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.

    • It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.

    • Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one

    Exposing Metrics
    previous example
    HPA Autoscaling for Single-Model Serving
    curl --location 'http://${MESH_IP}:9000/v2/models/text-gen/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'
    {
    	"model_name": "text-gen_1",
    	"model_version": "1",
    	"id": "121ff5f4-1d4a-46d0-9a5e-4cd3b11040df",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "output",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "hg_jsonlist"
    			},
    			"data": [
    				"{\"generated_text\": \"Once upon a time in a galaxy far away, the planet is full of strange little creatures. A very strange combination of creatures in that universe, that is. A strange combination of creatures in that universe, that is. A kind of creature that is\"}"
    			]
    		}
    	]
    }
    
    CIFAR10 image classification with outlier detector
  • Tabular income classification model with outlier detector

  • # samples/models/cifar10-outlier-detect.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10-outlier
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/outlier-detector"
      requirements:
        - mlserver
        - alibi-detect
    Alibi-Detect
    The
    capabilities
    field replaces the capabilities from the ServerConfig.

    Custom Server with Extra Capabilities

    The extraCapabilities field extends the existing list from the ServerConfig.

    import os
    os.environ["NAMESPACE"] = "seldon-mesh"
    MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP=MESH_IP[0]
    import os
    os.environ['MESH_IP'] = MESH_IP
    MESH_IP
    Seldon CLI
    hpa-custom-policy.yaml
    
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: irisa0-model-hpa
      namespace: seldon-mesh
    spec:
      scaleTargetRef:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Object
        object:
          metric:
            name: infer_rps
          describedObject:
            apiVersion: mlops.seldon.io/v1alpha1
            kind: Model
            name: irisa0
          target:
            type: AverageValue
            averageValue: 3
    ---
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: mlserver-server-hpa
      namespace: seldon-mesh
    spec:
      scaleTargetRef:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Server
        name: mlserver
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Object
        object:
          metric:
            name: infer_rps
          describedObject:
            apiVersion: mlops.seldon.io/v1alpha1
            kind: Model
            name: irisa0
          target:
            type: AverageValue
            averageValue: 3
    hpa-custom-policy.yaml
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: irisa0-model-hpa
      namespace: seldon-mesh
    spec:
      scaleTargetRef:
        ...
      minReplicas: 1
      maxReplicas: 3
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Pods
            value: 5
            periodSeconds: 90
      metrics:
        ...
    seldon model infer text-gen \
      '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'
    {
    	"model_name": "text-gen_1",
    	"model_version": "1",
    	"id": "121ff5f4-1d4a-46d0-9a5e-4cd3b11040df",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "output",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "hg_jsonlist"
    			},
    			"data": [
    				"{\"generated_text\": \"Once upon a time in a galaxy far away, the planet is full of strange little creatures. A very strange combination of creatures in that universe, that is. A strange combination of creatures in that universe, that is. A kind of creature that is\"}"
    			]
    		}
    	]
    }
    
    res = !seldon model infer text-gen --inference-mode grpc \
       '{"inputs":[{"name":"args","contents":{"bytes_contents":["T25jZSB1cG9uIGEgdGltZSBpbiBhIGdhbGF4eSBmYXIgYXdheQo="]},"datatype":"BYTES","shape":[1]}]}'
    import json
    import base64
    r = json.loads(res[0])
    base64.b64decode(r["outputs"][0]["contents"]["bytesContents"][0])
    b'{"generated_text": "Once upon a time in a galaxy far away\\n\\nThe Universe is a big and massive place. How can you feel any of this? Your body doesn\'t make sense if the Universe is in full swing \\u2014 you don\'t have to remember whether the"}'
    
    kubectl delete model text-gen
    seldon model unload text-gen
    cat ./models/hf-text-gen-custom-tiny-stories.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: custom-tiny-stories-text-gen
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/huggingface-text-gen-custom-tiny-stories"
      requirements:
        - huggingface
    
    kubectl apply -f ./models/hf-text-gen-custom-tiny-stories.yaml
    model.mlops.seldon.io/custom-tiny-stories-text-gen created
    kubectl get model custom-tiny-stories-text-gen -n ${NAMESPACE} -o json | jq -r '.status.conditions[] | select(.message == "ModelAvailable") | .status'
    True
    curl --location 'http://${MESH_IP}:9000/v2/models/custom-tiny-stories-text-gen/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'
    {
    	"model_name": "custom-tiny-stories-text-gen_1",
    	"model_version": "1",
    	"id": "d0fce59c-76e2-4f81-9711-1c93d08bcbf9",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "output",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "hg_jsonlist"
    			},
    			"data": [
    				"{\"generated_text\": \"Once upon a time in a galaxy far away. It was a very special place to live.\\n\"}"
    			]
    		}
    	]
    }
    
    seldon model load -f ./models/hf-text-gen-custom-tiny-stories.yaml
    {}
    
    seldon model status custom-tiny-stories-text-gen -w ModelAvailable | jq -M .
    {}
    
    seldon model infer custom-tiny-stories-text-gen \
      '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'
    {
    	"model_name": "custom-tiny-stories-text-gen_1",
    	"model_version": "1",
    	"id": "d0fce59c-76e2-4f81-9711-1c93d08bcbf9",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "output",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "hg_jsonlist"
    			},
    			"data": [
    				"{\"generated_text\": \"Once upon a time in a galaxy far away. It was a very special place to live.\\n\"}"
    			]
    		}
    	]
    }
    
    res = !seldon model infer custom-tiny-stories-text-gen --inference-mode grpc \
       '{"inputs":[{"name":"args","contents":{"bytes_contents":["T25jZSB1cG9uIGEgdGltZSBpbiBhIGdhbGF4eSBmYXIgYXdheQo="]},"datatype":"BYTES","shape":[1]}]}'
    import json
    import base64
    r = json.loads(res[0])
    base64.b64decode(r["outputs"][0]["contents"]["bytesContents"][0])
    b'{"generated_text": "Once upon a time in a galaxy far away\\nOne night, a little girl named Lily went to"}'
    
    kubectl delete custom-tiny-stories-text-gen
    seldon model unload custom-tiny-stories-text-gen
    As a next step, why not try running a larger-scale model? You can find a definition for one in ./models/hf-text-gen-custom-gpt2.yaml. However, you may need to request and allocate more memory!
    '172.18.255.2'
    
    cat ./servers/custom-mlserver-capabilities.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-134
    spec:
      serverConfig: mlserver
      capabilities:
      - mlserver-1.3.4
      podSpec:
        containers:
        - image: seldonio/mlserver:1.3.4
          name: mlserver
    
    kubectl create -f ./servers/custom-mlserver-capabilities.yaml -n ${NAMESPACE}
    server.mlops.seldon.io/mlserver-134 created
    
    kubectl wait --for condition=ready --timeout=300s server --all -n ${NAMESPACE}
    server.mlops.seldon.io/mlserver condition met
    server.mlops.seldon.io/mlserver-134 condition met
    server.mlops.seldon.io/triton condition met
    
    cat ./models/iris-custom-requirements.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - mlserver-1.3.4
    
    kubectl create -f ./models/iris-custom-requirements.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    
    seldon model infer iris --inference-host ${MESH_IP}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "057ae95c-e6bc-4f57-babf-0817ff171729",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    kubectl delete -f ./models/iris-custom-server.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    
    kubectl delete -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}
    server.mlops.seldon.io "mlserver-134" deleted
    
    cat ./servers/custom-mlserver.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-134
    spec:
      serverConfig: mlserver
      extraCapabilities:
      - mlserver-1.3.4
      podSpec:
        containers:
        - image: seldonio/mlserver:1.3.4
          name: mlserver
    
    kubectl create -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}
    server.mlops.seldon.io/mlserver-134 created
    
    kubectl wait --for condition=ready --timeout=300s server --all -n ${NAMESPACE}
    server.mlops.seldon.io/mlserver condition met
    server.mlops.seldon.io/mlserver-134 condition met
    server.mlops.seldon.io/triton condition met
    
    cat ./models/iris-custom-server.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      server: mlserver-134
    
    kubectl create -f ./models/iris-custom-server.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    
    seldon model infer iris --inference-host ${MESH_IP}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "a3e17c6c-ee3f-4a51-b890-6fb16385a757",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    kubectl delete -f ./models/iris-custom-server.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    
    kubectl delete -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}
    server.mlops.seldon.io "mlserver-134" deleted
    

    Storage Secrets

    Learn how to configure storage secrets in Seldon Core 2 for secure model artifact access using Rclone, including AWS S3, GCS, and MinIO integration.

    Inference artifacts referenced by Models can be stored in any of the storage backends supported by Rclone. This includes local filesystems, AWS S3, and Google Cloud Storage (GCS), among others. Configuration is provided out-of-the-box for public GCS buckets, which enables the use of Seldon-provided models like in the below example:

    This configuration is provided by the Kubernetes Secret seldon-rclone-gs-public. It is made available to Servers as a preloaded secret. You can define and use your own storage configurations in exactly the same way.

    Configuration Format

    To define a new storage configuration, you need the following details:

    • Remote name

    • Remote type

    • Provider parameters

    A remote is what Rclone calls a storage location. The type defines what protocol Rclone should use to talk to this remote. A provider is a particular implementation for that storage type. Some storage types have multiple providers, such as s3 having AWS S3 itself, MinIO, Ceph, and so on.

    The remote name is your choice. The prefix you use for models in spec.storageUri must be the same as this remote name.

    The remote type is one of the values . For example, for AWS S3 it is s3 and for Dropbox it is dropbox.

    The provider parameters depend entirely on the remote type and the specific provider you are using. Please check the Rclone documentation for the appropriate provider. Note that Rclone docs for storage types call the parameters properties and provide both config and env var formats--you need to use the config format. For example, the GCS parameter --gcs-client-id described should be used as client_id.

    For reference, this format is described in the . Note that we do not support the use of opts discussed in that section.

    Kubernetes Secrets

    Kubernetes Secrets are used to store Rclone configurations, or storage secrets, for use by Servers. Each Secret should contain exactly one Rclone configuration.

    A Server can use storage secrets in one of two ways:

    • It can dynamically load a secret specified by a Model in its .spec.secretName

    • It can use global configurations made available via

    The name of a Secret is entirely your choice, as is the name of the data key in that Secret. All that matters is that there is a single data key and that its value is in the format described above.

    Note: It is possible to use preloaded secrets for some Models and dynamically loaded secrets for others.

    Preloaded Secrets

    Rather than Models always having to specify which secret to use, a Server can load storage secrets ahead of time. These can then be reused across many Models.

    When using a preloaded secret, the Model definition should leave .spec.secretName empty. The protocol prefix in .spec.storageUri still needs to match the remote name specified by a storage secret.

    The secrets to preload are named in a centralised ConfigMap called seldon-agent. This ConfigMap applies to all Servers managed by the same SeldonRuntime. By default this ConfigMap only includes seldon-rclone-gs-public, but can be extended with your own secrets as shown below:

    The easiest way to change this is to update your SeldonRuntime.

    • If your SeldonRuntime is configured using the seldon-core-v2-runtime Helm chart, the corresponding value is config.agentConfig.rclone.configSecrets. This can be used as shown below:

    • Otherwise, if your SeldonRuntime is configured directly, you can add secrets by setting .spec.config.agentConfig.rclone.config_secrets. This can be used as follows:

    Examples

    Assuming you have installed MinIO in the minio-system namespace, a corresponding secret could be:

    You can then reference this in a Model with .spec.secretName:

    can use for access. You can generate the credentials for a service account using the :

    The contents of gcloud-application-credentials.json can be put into a secret:

    You can then reference this in a Model with .spec.secretName:

    Run Inference

    We will show:

    • Model inference to a Tensorflow model

      • REST and gRPC using seldon CLI, curl and grpcurl

    • Pipeline inference

      • REST and gRPC using seldon CLI, curl and grpcurl

    Tensorflow Model

    Load the model.

    Wait for the model to be ready.

    Kubernetes Server with PVC

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see Seldon CLI.

    Kind cluster setup

    To run this example in Kind we need to start Kind with access to a local folder where are models are location. In this example it is a folder in /tmp and associate that with a path in the container.

    To start a Kind cluster see, .

    Create the local folder for models and copy an example iris sklearn model to it.

    Create Server with PVC

    Create a storage class and associated persistent colume referencing the /models folder where models are stored.

    Now create a new Server based on the provided MLServer configuration but extend it with our PVC by adding this to the rclone container which will allow rclone to move models from this PVC onto the server.

    We also add a new capability pvc to allow us to schedule models to this server that has the PVC.

    SKLearn Model

    Use a simple sklearn iris classification model with the added pvc requirement so that MLServer with the PVC is targeted during scheduling.

    Do a gRPC inference call

    Kafka Integration

    Learn how to set up and configure Kafka for Seldon Core in production environments, including cluster setup and security configuration.

    Kafka is a component in the Seldon Core 2 ecosystem, that provides scalable, reliable, and flexible communication for machine learning deployments. It serves as a strong backbone for building complex inference pipelines, managing high-throughput asynchronous predictions, and seamlessly integrating with event-driven systems—key features needed for contemporary enterprise-grade ML platforms.

    An inference request is a request sent to a machine learning model to make a prediction or inference based on input data. It is a core concept in deploying machine learning models in production, where models serve predictions to users or systems in real-time or batch mode.

    To explore this feature of Seldon Core 2, you need to integrate with Kafka. Integrate Kafka through managed cloud services or by deploying it directly within a Kubernetes cluster.

    Note: Kafka is an external component outside of the main Seldon stack. Therefore, it is the cluster administrator's responsibility to administrate and manage the Kafka instance used by Seldon. For production installation it is highly recommended to use managed Kafka instance.

    • provides more information about the encrytion and authentication.

    • provides the steps to configure some of the managed Kafka services.

    Self-hosted Kafka

    Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. To install Kafka for testing purposed in your Kubernetes cluster, use . For more information, see

    Resource allocation

    Learn more about using taints and tolerations with node affinity or node selector to allocate resources in a Kubernetes cluster.

    When deploying machine learning models in Kubernetes, you may need to control which infrastructure resources these models use. This is especially important in environments where certain workloads, such as resource-intensive models, should be isolated from others or where specific hardware such as GPUs, needs to be dedicated to particular tasks. Without fine-grained control over workload placement, models might end up running on suboptimal nodes, leading to inefficiencies or resource contention.

    For example, you may want to:

    • Isolate inference workloads from control plane components or other services to prevent resource contention.

    • Ensure that GPU nodes are reserved exclusively for models that require hardware acceleration.

    • Keep business-critical models on dedicated nodes to ensure performance and reliability.

    • Run external dependencies like Kafka on separate nodes to avoid interference with inference workloads.

    To solve these problems, Kubernetes provides mechanisms such as taints, tolerations, and nodeAffinity or nodeSelector to control resource allocation and workload scheduling.

    are applied to nodes and to Pods to control which Pods can be scheduled on specific nodes within the Kubernetes cluster. Pods without a matching toleration for a node’s taint are not scheduled on that node. For instance, if a node has GPUs or other specialized hardware, you can prevent Pods that don’t need these resources from running on that node to avoid unnecessary resource usage.

    Note: alone do not ensure that a Pod runs on a tainted node. Even if a Pod has the correct toleration, Kubernetes may still schedule it on other nodes without taints. To ensure a Pod runs on a specific node, you need to also use and rules.

    When used together, taints and tolerations with nodeAffinity or nodeSelector can effectively allocate certain Pods to specific nodes, while preventing other Pods from being scheduled on those nodes.

    In a Kubernetes cluster running Seldon Core 2, this involves two key configurations:

    1. Configuring servers to run on specific nodes using mechanisms like taints, tolerations, and nodeAffinity or nodeSelector.

    2. Configuring models so that they are scheduled and loaded on the appropriate servers.

    This ensures that models are deployed on the optimal infrastructure and servers that meet their requirements.

    Server Config

    Note: This section is for advanced usage where you want to define new types of inference servers.

    Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the Open Inference Protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:

    Checking Pipeline readiness

    Local example settings.

    Remote k8s cluster example settings - change as neeed for your needs.

    Model Chain - Ready Check

    We will check the readiness of the Pipeline after every change to model and pipeline.

    Experiment version examples

    This notebook will show how we can update running experiments.

    Test change candidate for a model

    We will use three SKlearn Iris classification models to illustrate experiment updates.

    Load all models.

    Let's call all three models individually first.

    We will start an experiment to change the iris endpoint to split traffic with the iris2 model.

    Seldon Config

    Learn how to configure Seldon Core installation components using SeldonConfig resource, including component specifications, Kafka settings, and tracing configuration.

    Note: This section is for advanced usage where you want to define how seldon is installed in each namespace.

    The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.

    The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).

    Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.

    Drift Detection

    Learn how to implement drift detection in Seldon Core 2 using Alibi-Detect integration for model monitoring and batch processing.

    Drift detection models are treated as any other Model. You can run any saved drift detection model by adding the requirement alibi-detect.

    An example drift detection model from the CIFAR10 image classification example is shown below:

    Usually you would run these models in an asynchronous part of a Pipeline, i.e. they are not connected to the output of the Pipeline which defines the synchronous path. For example, the CIFAR-10 image detection example uses a pipeline as shown below:

    Note how the cifar10-drift model is not part of the path to the outputs. Drift alerts can be read from the Kafka topic of the model.

    Core 2 Configuration

    When tuning performance for pipelines, reducing the overhead of Core 2 components responsible for data-processing within a pipeline is another aspect to consider. In Core 2, four components influence this overhead:

    • pipelinegateway which handles pipeline requests

    • modelgateway which sends requests to model inference servers

    # samples/models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    import os
    os.environ["NAMESPACE"] = "seldon-mesh"
    MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP=MESH_IP[0]
    import os
    os.environ['MESH_IP'] = MESH_IP
    MESH_IP
    '172.19.255.1'
    dataflow-engine runs Kafka KStream topologies to manage data streamed between models in a pipeline
  • the Kafka cluster

  • Given that Core 2 uses Kafka as the messaging system to communicate data between models, lowering the network latency between Core 2 and Kafka (especially when the Kafka installation is in a separate cluster to Core) will improve pipeline performance.

    Additionally, the number of Kafka partitions per topic (which must be fixed across all models in a pipeline) significantly influences

    1. Kafka’s maximum throughput, and

    2. the effective number of replicas for pipelinegateway, dataflow-engine and modelgateway

    As a baseline for serving a high inferencing RPS across multiple pipelines, we recommend using as many replicas of pipelinegateway and dataflow-engine as you have Kafka topic partitions in order to leverage the balanced distribution of inference traffic. In this case, each dataflow-engine will process the data from one partition, across all pipelines. Increasing the number of dataflow-engine replicas further starts sharding the pipelines across the available replicas, with each pipeline being processed by maxShardCountMultiplier replicas (see detailed pipeline scalability docs for configuration details)

    Similarly, modelgateway can handle more throughput if its number of workers and number of replicas is increased. modelgateway has two scalability parameters that can be set via environment variables:

    • MODELGATEWAY_NUM_WORKERS

    • MODELGATEWAY_MAX_NUM_CONSUMERS

    Each model within a Kubernetes namespace is consistently assigned to one modelgateway consumer (based on their index in a hash table of size MODELGATEWAY_MAX_NUM_CONSUMERS); The size of the hash table influences how many models will share the same consumer.

    For each consumer, a MODELGATEWAY_NUM_WORKERS number of lightweight inference workers (goroutines) are created to forward requests to the inference servers and wait for responses.

    Increasing these parameters (starting with an increase in the number of workers) will improve throughput if the modelgateway pod has enough resources to support more workers and consumers.

    Securing Kafka
    Configuration examples
    Strimzi Operator
    Self Hosted Kafka
    Taints
    tolerations
    Taints and tolerations
    node affinity
    node selector
    supported by Rclone
    here
    Rclone documentation
    preloaded secrets
    GCS
    service accounts
    gcloud CLI
    %env INFER_ENDPOINT=0.0.0.0:9000
    env: INFER_ENDPOINT=0.0.0.0:9000
    
    cat ./models/tfsimple1.yaml
    Learning environment
    Examples
    • CIFAR10 image classification with drift detector

    • Tabular income classification model with drift detector

    # samples/models/cifar10-drift-detect.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10-drift
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/drift-detector"
      requirements:
        - mlserver
        - alibi-detect
    Alibi-Detect

    Models will still be ready even though Pipeline terminated

    Kubernetes Resource Example

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see Seldon CLI.

    Now when we call the iris model we should see a roughly 50/50 split between the two models.

    Now we update the experiment to change to a split with the iris3 model.

    Now we should see a split with the iris3 model.

    Now the experiment has been stopped we check everything as before.

    Test change default model in an experiment

    Here we test changing the model we want to split traffic on. We will use three SKlearn Iris classification models to illustrate.

    Let's call all three models to verify initial conditions.

    Now we start an experiment to change calls to the iris model to split with the iris2 model.

    Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.

    Now let's change the model we want to experiment to modify to the iris3 model. Splitting between that and iris2.

    Let's check the iris model is now as before but the iris3 model has traffic split.

    Finally, let's check now the experiment has stopped as is as at the start.

    seldon model load -f ./models/sklearn1.yaml
    seldon model load -f ./models/sklearn2.yaml
    seldon model load -f ./models/sklearn3.yaml
    {}
    {}
    {}
    
    # samples/auth/agent.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: seldon-agent
    data:
      agent.json: |-
       {
          "rclone" : {
              "config_secrets": ["seldon-rclone-gs-public","minio-secret"]
          },
       }
    config:
      agentConfig:
        rclone:
          configSecrets:
            - my-s3
            - custom-gcs
            - minio-in-cluster
    apiVersion: mlops.seldon.io/v1alpha1
    kind: SeldonRuntime
    metadata:
      name: seldon
    spec:
      seldonConfig: default
      config:
        agentConfig:
          rclone:
            config_secrets:
              - my-s3
              - custom-gcs
              - minio-in-cluster
      ...
    # samples/auth/minio-secret.yaml
    apiVersion: v1
    kind: Secret
    metadata:
      name: minio-secret
      namespace: seldon-mesh
    type: Opaque
    stringData:
      s3: |
        type: s3
        name: s3
        parameters:
          provider: minio
          env_auth: false
          access_key_id: minioadmin
          secret_access_key: minioadmin
          endpoint: http://minio.minio-system:9000
    # samples/models/sklearn-iris-minio.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "s3://models/iris"
      secretName: "minio-secret"
      requirements:
      - sklearn
    gcloud iam service-accounts keys create \
      gcloud-application-credentials.json \
      --iam-account [SERVICE-ACCOUNT--NAME]@[PROJECT-ID].iam.gserviceaccount.com
    apiVersion: v1
    kind: Secret
    metadata:
      name: gcs-bucket
    type: Opaque
    stringData:
      gcs: |
        type: gcs
        name: gcs
        parameters:
          service_account_credentials: '<gcloud-application-credentials.json>'
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mymodel
    spec:
      storageUri: "gcs://my-bucket/my-path/my-pytorch-model"
      secretName: "gcs-bucket"
      requirements:
      - pytorch
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    seldon model load -f ./models/tfsimple1.yaml
    {}
    
    seldon model status tfsimple1 -w ModelAvailable | jq -M .
    {}
    
    seldon model infer tfsimple1 --inference-host ${INFER_ENDPOINT} \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "tfsimple1_1",
    	"model_version": "1",
    	"outputs": [
    		{
    			"name": "OUTPUT0",
    			"datatype": "INT32",
    			"shape": [
    				1,
    				16
    			],
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			]
    		},
    		{
    			"name": "OUTPUT1",
    			"datatype": "INT32",
    			"shape": [
    				1,
    				16
    			],
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			]
    		}
    	]
    }
    
    seldon model infer tfsimple1 --inference-mode grpc  --inference-host ${INFER_ENDPOINT} \
        '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'
    {"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    
    curl http://${INFER_ENDPOINT}/v2/models/tfsimple1/infer -H "Content-Type: application/json" -H "seldon-model: tfsimple1" \
            -d '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {"model_name":"tfsimple1_1","model_version":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":[1,16],"data":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]},{"name":"OUTPUT1","datatype":"INT32","shape":[1,16],"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}]}
    
    grpcurl -d '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimple1 \
        ${INFER_ENDPOINT} inference.GRPCInferenceService/ModelInfer
    {
      "modelName": "tfsimple1_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ]
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ]
        }
      ],
      "rawOutputContents": [
        "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==",
        "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="
      ]
    }
    
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    seldon pipeline load -f ./pipelines/tfsimple.yaml
    {}
    
    seldon pipeline status tfsimple -w PipelineReady
    {"pipelineName":"tfsimple","versions":[{"pipeline":{"name":"tfsimple","uid":"cg5fm6c6dpcs73c4qhe0","version":1,"steps":[{"name":"tfsimple1"}],"output":{"steps":["tfsimple1.outputs"]},"kubernetesMeta":{}},"state":{"pipelineVersion":1,"status":"PipelineReady","reason":"created pipeline","lastChangeTimestamp":"2023-03-10T09:40:41.317797761Z","modelsReady":true}}]}
    
    seldon pipeline infer tfsimple  --inference-host ${INFER_ENDPOINT} \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			],
    			"name": "OUTPUT0",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		},
    		{
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			],
    			"name": "OUTPUT1",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		}
    	]
    }
    
    seldon pipeline infer tfsimple --inference-mode grpc  --inference-host ${INFER_ENDPOINT} \
        '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'
    {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    
    curl http://${INFER_ENDPOINT}/v2/models/tfsimple1/infer -H "Content-Type: application/json" -H "seldon-model: tfsimple.pipeline" \
            -d '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {"model_name":"","outputs":[{"data":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32],"name":"OUTPUT0","shape":[1,16],"datatype":"INT32"},{"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"name":"OUTPUT1","shape":[1,16],"datatype":"INT32"}]}
    
    grpcurl -d '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimple.pipeline \
        ${INFER_ENDPOINT} inference.GRPCInferenceService/ModelInfer
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ]
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ]
        }
      ],
      "rawOutputContents": [
        "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==",
        "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="
      ]
    }
    
    seldon pipeline unload tfsimple
    seldon model unload tfsimple1
    {}
    {}
    
    cat kind-config.yaml
    apiVersion: kind.x-k8s.io/v1alpha4
    kind: Cluster
    nodes:
    - role: control-plane
      extraMounts:
        - hostPath: /tmp/models
          containerPath: /models
    mkdir -p /tmp/models
    gsutil cp -r gs://seldon-models/mlserver/iris /tmp/models
    cat pvc.yaml
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-path-immediate
    provisioner: rancher.io/local-path
    reclaimPolicy: Delete
    mountOptions:
      - debug
    volumeBindingMode: Immediate
    ---
    kind: PersistentVolume
    apiVersion: v1
    metadata:
      name: ml-models-pv
      namespace: seldon-mesh
      labels:
        type: local
    spec:
      storageClassName: local-path-immediate
      capacity:
        storage: 1Gi
      accessModes:
        - ReadWriteOnce
      hostPath:
        path: "/models"
    ---
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: ml-models-pvc
      namespace: seldon-mesh
    spec:
      storageClassName: local-path-immediate
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      selector:
        matchLabels:
          type: local
    cat server.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-pvc
    spec:
      serverConfig: mlserver
      extraCapabilities:
      - "pvc"
      podSpec:
        volumes:
        - name: models-pvc
          persistentVolumeClaim:
            claimName: ml-models-pvc
        containers:
        - name: rclone
          volumeMounts:
          - name: models-pvc
            mountPath: /var/models
    cat ./iris.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "/var/models/iris"
      requirements:
      - sklearn
      - pvc
    kubectl create -f iris.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    kubectl get model iris -n ${NAMESPACE} -o jsonpath='{.status}' | jq -M .
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-12-24T11:04:37Z",
          "status": "True",
          "type": "ModelReady"
        },
        {
          "lastTransitionTime": "2022-12-24T11:04:37Z",
          "status": "True",
          "type": "Ready"
        }
      ],
      "replicas": 1
    }
    curl -k http://${MESH_IP}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [[1,2,3,4]]
          }
        ]
      }' | jq -M .
    seldon model infer iris --inference-host ${MESH_IP}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "dc032bcc-3f4e-4395-a2e4-7c1e3ef56e9e",
    	"parameters": {
    		"content_type": null,
    		"headers": null
    	},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": null,
    			"data": [
    				2
    			]
    		}
    	]
    }
    curl -k http://${MESH_IP}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "model_name": "iris",
        "inputs": [
          {
            "name": "input",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP}:80 \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .
    {
      "modelName": "iris_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "predict",
          "datatype": "INT64",
          "shape": [
            "1",
            "1"
          ],
          "contents": {
            "int64Contents": [
              "2"
            ]
          }
        }
      ]
    }
    kubectl delete -f ./iris.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    # samples/pipelines/cifar10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: cifar10-production
    spec:
      steps:
        - name: cifar10
        - name: cifar10-outlier
        - name: cifar10-drift
          batch:
            size: 20
      output:
        steps:
        - cifar10
        - cifar10-outlier.outputs.is_outlier
    seldon model load -f ./models/tfsimple1.yaml
    seldon model status tfsimple1 -w ModelAvailable
    {}
    {}
    
    kubectl apply -f ./models/tfsimple2.yaml -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model tfsimple2 -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple2 created
    model.mlops.seldon.io/tfsimple2 condition met
    seldon model load -f ./models/tfsimple2.yaml
    seldon model status tfsimple2 -w ModelAvailable | jq -M .
    {}
    {}
    
    %env INFER_REST_ENDPOINT=http://0.0.0.0:9000
    %env INFER_GRPC_ENDPOINT=0.0.0.0:9000
    %env SELDON_SCHEDULE_HOST=0.0.0.0:9004
    env: INFER_REST_ENDPOINT=http://0.0.0.0:9000
    env: INFER_GRPC_ENDPOINT=0.0.0.0:9000
    env: SELDON_SCHEDULE_HOST=0.0.0.0:9004
    
    #%env INFER_REST_ENDPOINT=http://172.19.255.1:80
    #%env INFER_GRPC_ENDPOINT=172.19.255.1:80
    #%env SELDON_SCHEDULE_HOST=172.19.255.2:9004
    cat ./pipelines/tfsimples.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimples
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
          inputs:
          - tfsimple1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple1.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple2
    
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    ERROR:
      Code: Unimplemented
      Message:
    
    kubectl apply -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}
    pipeline.mlops.seldon.io/tfsimples created
    kubectl wait --for condition=ready --timeout=300s pipeline tfsimples -n ${NAMESPACE}
    seldon pipeline load -f ./pipelines/tfsimples.yaml
    seldon pipeline status tfsimples -w PipelineReady
    {"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"ciepit2i8ufs73flaitg", "version":1, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:47:16.365934922Z"}}]}
    
    seldon pipeline status tfsimples | jq .versions[0].state.modelsReady
    null
    
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    {
    
    }
    
    kubectl apply -f ./models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple1 condition met
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    {
    
    }
    
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    {
      "ready": true
    }
    
    seldon pipeline status tfsimples | jq .versions[0].state.modelsReady
    true
    
    seldon pipeline unload tfsimples
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    ERROR:
      Code: Unimplemented
      Message:
    
    seldon pipeline status tfsimples | jq .versions[0].state.modelsReady
    true
    
    seldon pipeline load -f ./pipelines/tfsimples.yaml
    seldon pipeline status tfsimples -w PipelineReady
    {"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"ciepj5qi8ufs73flaiu0", "version":1, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:47:51.626155116Z", "modelsReady":true}}]}
    
    curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/ready
    grpcurl -d '{"name":"tfsimples"}' \
        -plaintext \
        -import-path ../apis \
        -proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
        -rpc-header seldon-model:tfsimples.pipeline \
        ${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady
    {
      "ready": true
    }
    
    seldon pipeline status tfsimples | jq .versions[0].state.modelsReady
    true
    
    seldon model unload tfsimple1
    seldon model unload tfsimple2
    seldon pipeline status tfsimples | jq .versions[0].state.modelsReady
    null
    
    seldon pipeline unload tfsimples
    import os
    os.environ["NAMESPACE"] = "seldon-mesh"
    MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP=MESH_IP[0]
    import os
    os.environ['MESH_IP'] = MESH_IP
    MESH_IP
    '172.19.255.1'
    
    kubectl create -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}
    pipeline.mlops.seldon.io/tfsimples created
    
    kubectl wait --for condition=ready --timeout=1s pipeline --all -n ${NAMESPACE}
    error: timed out waiting for the condition on pipelines/tfsimples
    
    kubectl get pipeline tfsimples -o jsonpath='{.status.conditions[0]}' -n ${NAMESPACE}
    {"lastTransitionTime":"2022-11-14T10:25:31Z","status":"False","type":"ModelsReady"}
    
    kubectl create -f ./models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl create -f ./models/tfsimple2.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n ${NAMESPACE}
    pipeline.mlops.seldon.io/tfsimples condition met
    
    kubectl get pipeline tfsimples -o jsonpath='{.status.conditions[0]}' -n ${NAMESPACE}
    {"lastTransitionTime":"2022-11-14T10:25:49Z","status":"True","type":"ModelsReady"}
    
    kubectl delete -f ./models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl delete -f ./models/tfsimple2.yaml -n ${NAMESPACE}
    kubectl delete -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    pipeline.mlops.seldon.io "tfsimples" deleted
    
    seldon model status iris -w ModelAvailable
    seldon model status iris2 -w ModelAvailable
    seldon model status iris3 -w ModelAvailable
    {}
    {}
    {}
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    seldon model infer iris3 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris3_1::50]
    
    cat ./experiments/ab-default-model.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model.yaml
    {}
    
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 100 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::48 :iris_1::52]
    
    cat ./experiments/ab-default-model2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris3
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model2.yaml
    {}
    
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 100 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris3_1::42 :iris_1::58]
    
    seldon experiment stop experiment-sample
    {}
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    seldon model infer iris3 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris3_1::50]
    
    seldon model unload iris
    seldon model unload iris2
    seldon model unload iris3
    {}
    {}
    {}
    
    seldon model load -f ./models/sklearn1.yaml
    seldon model load -f ./models/sklearn2.yaml
    seldon model load -f ./models/sklearn3.yaml
    {}
    {}
    {}
    
    seldon model status iris -w ModelAvailable
    seldon model status iris2 -w ModelAvailable
    seldon model status iris3 -w ModelAvailable
    {}
    {}
    {}
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    seldon model infer iris3 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris3_1::50]
    
    cat ./experiments/ab-default-model.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model.yaml
    {}
    
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 100 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::51 :iris_1::49]
    
    cat ./experiments/ab-default-model3.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris3
      candidates:
      - name: iris3
        weight: 50
      - name: iris2
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model3.yaml
    {}
    
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris3 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::25 :iris3_1::25]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    seldon experiment stop experiment-sample
    {}
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    seldon model infer iris3 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris3_1::50]
    
    seldon model unload iris
    seldon model unload iris2
    seldon model unload iris3
    {}
    {}
    {}
    

    The default configuration is shown below.

    SeldonRuntime

    Testing Pipelines

    In order to understand how pipelines will behave in a production setting, it is first helpful to isolate testing to the individual models within the pipeline. By obtaining the maximum throughput for each model given the infrastructure each are running on (e.g. GPU specs) and their configurations (e.g. how many workers are used), users will gain a better understanding of which models might be limiting to the performance of the pipeline around it. Given the performance profiles of models within a pipeline, it is recommended to optimize for the desired performance of those individual models first (see here), ensuring they have the right number of replicas and are running on the right infrastructure in order to achieve the right level of performance. Similarly, once a model within a pipeline is identified as a bottleneck, refer back to the models section to optimize the performance of that model.

    The performance behavior of Seldon Core 2 pipelines is more complex compared to individual models. Inference request latency can be broken down into:

    1. The sum of the latencies for each model in the critical path of the pipeline (the one containing the bottleneck)

    2. The per-stage data processing overhead caused by pipeline-specific operations (data movement, copying, writing to Kafka topics, etc)

    Reducing Bottlenecks

    To simplify, first, we can consider a linear pipeline that consists of a chain of sequential components:

    The maximum throughput achievable through the pipeline is the minimum of the maximum throughputs achievable by each individual model in the pipeline, given a number of workers. Exceeding this maximum will create a bottleneck in processing, degrading performance for the pipeline. To prevent bottlenecking from a given step in your pipeline, refer back to the section to optimize the performance of that model. For example, you can:

    • Increase the resources available to a model’s server, as well as the number of workers

    • Increase the number of model replicas among which the load is balanced (autoscaling can help set the correct number)

    In the case of a more complex, non-linear pipeline, the first step would be to identify the critical path within the pipeline (the path containing the model that gets it’s throughput saturated first), and then, based on that critical path, follow the same steps above for any model in that path that creates a bottleneck. There will always be a bottleneck model in a pipeline; the goal is to balance the number of replicas of each model and/or the MLServer resources together with the number of workers so that each stage in the pipeline is able to handle the request throughput with as little queueing as possible happening between the pipeline stages, therefore reducing overall latency.

    Health

    Models

    This section covers various aspects of optimizing model performance in Seldon Core 2, from initial load testing to infrastructure setup and inference optimization. Each subsection provides detailed guidance on different aspects of model performance tuning:

    Load Testing

    Learn how to conduct effective load testing to understand your model's performance characteristics:

    • Determining load saturation points

    • Understanding closed-loop vs. open-loop testing

    • Determining the right number of replicas based on your configuration (model, infrastructure, etc.)

    • Setting up reproducible test environments

    • Interpreting test results for autoscaling configuration

    Explore different approaches to optimize inference performance:

    • Choosing between gRPC and REST protocols

    • Implementing adaptive batching

    • Optimizing input dimensions

    • Configuring parallel processing with workers

    Understand how to configure the underlying infrastructure for optimal model performance:

    • Choosing between CPU and GPU deployments

    • Setting appropriate CPU specifications

    • Configuring thread affinity

    • Managing memory allocation

    Each of these aspects plays a crucial role in achieving optimal model performance. We recommend starting with to establish a baseline, then using the insights gained to inform your and strategies.

    Metadata

    Pipelines

    Learn how to create and manage ML inference pipelines in Seldon Core, including model chaining, tensor mapping, and conditional logic.

    Pipelines allow models to be connected into flows of data transformations. This allows more complex machine learning pipelines to be created with multiple models, feature transformations and monitoring components such as drift and outlier detectors.

    Creating Pipelines

    The simplest way to create Pipelines is by defining them with the . This format is accepted by our Kubernetes implementation but also locally via our seldon CLI.

    Internally in both cases Pipelines are created via our . Advanced users could submit Pipelines directly using this gRPC service.

    Inference Artifacts

    Learn how to manage model artifacts in Seldon Core, including storage, versioning, and deployment workflows.

    To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.

    Type
    Server
    Tag
    Example

    Server Runtime

    Learn about SeldonRuntime, a Kubernetes resource for creating and managing Seldon Core instances in specific namespaces with configurable settings.

    The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.

    For the definition of SeldonConfiguration above see the .

    The specification above contains overrides for the chosen SeldonConfig. To override the PodSpec for a given component, the overrides field needs to specify the component name and the PodSpec needs to specify the container name, along with fields to override.

    For instance, the following overrides the resource limits for cpu and memory

    Observability

    Installing kube-prometheus-stack in the same Kubernetes cluster that hosts the Seldon Core 2.

    kube-prometheus, also known as Prometheus Operator, is a popular open-source project that provides complete monitoring and alerting solutions for Kubernetes clusters. It combines tools and components to create a monitoring stack for Kubernetes environments.

    Note: In this example Prometheus is installed within the same Kubernetes cluster as the Seldon Core 2. However, Seldon Core 2 exposes metrics to any of the managed Prometheus endpoints as well.

    The Seldon Core 2, along with any deployed models, automatically exposes metrics to Prometheus. By default, certain alerting rules are pre-configured, and an alertmanager instance is included.

    You can install kube-prometheus to monitor Seldon components, and ensure that the appropriate

    Inference

    type SeldonConfigSpec struct {
    	Components []*ComponentDefn    `json:"components,omitempty"`
    	Config     SeldonConfiguration `json:"config,omitempty"`
    }
    
    type SeldonConfiguration struct {
    	TracingConfig TracingConfig      `json:"tracingConfig,omitempty"`
    	KafkaConfig   KafkaConfig        `json:"kafkaConfig,omitempty"`
    	AgentConfig   AgentConfiguration `json:"agentConfig,omitempty"`
    	ServiceConfig ServiceConfig      `json:"serviceConfig,omitempty"`
    }
    
    type ServiceConfig struct {
    	GrpcServicePrefix string         `json:"grpcServicePrefix,omitempty"`
    	ServiceType       v1.ServiceType `json:"serviceType,omitempty"`
    }
    
    type KafkaConfig struct {
    	BootstrapServers      string                        `json:"bootstrap.servers,omitempty"`
    	ConsumerGroupIdPrefix string                        `json:"consumerGroupIdPrefix,omitempty"`
    	Debug                 string                        `json:"debug,omitempty"`
    	Consumer              map[string]intstr.IntOrString `json:"consumer,omitempty"`
    	Producer              map[string]intstr.IntOrString `json:"producer,omitempty"`
    	Streams               map[string]intstr.IntOrString `json:"streams,omitempty"`
    	TopicPrefix           string                        `json:"topicPrefix,omitempty"`
    }
    
    type AgentConfiguration struct {
    	Rclone RcloneConfiguration `json:"rclone,omitempty" yaml:"rclone,omitempty"`
    }
    
    type RcloneConfiguration struct {
    	ConfigSecrets []string `json:"config_secrets,omitempty" yaml:"config_secrets,omitempty"`
    	Config        []string `json:"config,omitempty" yaml:"config,omitempty"`
    }
    
    type TracingConfig struct {
    	Disable              bool   `json:"disable,omitempty"`
    	OtelExporterEndpoint string `json:"otelExporterEndpoint,omitempty"`
    	OtelExporterProtocol string `json:"otelExporterProtocol,omitempty"`
    	Ratio                string `json:"ratio,omitempty"`
    }
    
    type ComponentDefn struct {
    	// +kubebuilder:validation:Required
    
    	Name                 string                  `json:"name"`
    	Labels               map[string]string       `json:"labels,omitempty"`
    	Annotations          map[string]string       `json:"annotations,omitempty"`
    	Replicas             *int32                  `json:"replicas,omitempty"`
    	PodSpec              *v1.PodSpec             `json:"podSpec,omitempty"`
    	VolumeClaimTemplates []PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
    }
    Understanding CPU vs. GPU utilization
  • Optimizing your model artefact

  • Optimizing resource utilization
    Inference
    Infrastructure Setup
    load testing
    infrastructure setup
    inference optimization
    models
    Linear pipeline (chain)
    Critical Path in a complex pipeline
    in the
    hodometer
    component in the
    seldon-mesh
    namespace, while using values specified in the
    seldonConfig
    elsewhere (e.g.
    default
    ).

    As a minimal use you should just define the SeldonConfig to use as a base for this install, for example to install in the seldon-mesh namespace with the SeldonConfig named default:

    The helm chart seldon-core-v2-runtime allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.

    SeldonConfig Update Propagation

    When a SeldonConfig resource changes any SeldonRuntime resources that reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired you can set spec.disableAutoUpdate in the SeldonRuntime resource for it not be be updated immediately but only when it changes or any owned resource changes.

    type SeldonRuntimeSpec struct {
    	SeldonConfig string              `json:"seldonConfig"`
    	Overrides    []*OverrideSpec     `json:"overrides,omitempty"`
    	Config       SeldonConfiguration `json:"config,omitempty"`
    	// +Optional
    	// If set then when the referenced SeldonConfig changes we will NOT update the SeldonRuntime immediately.
    	// Explicit changes to the SeldonRuntime itself will force a reconcile though
    	DisableAutoUpdate bool `json:"disableAutoUpdate,omitempty"`
    }
    
    type OverrideSpec struct {
    	Name        string         `json:"name"`
    	Disable     bool           `json:"disable,omitempty"`
    	Replicas    *int32         `json:"replicas,omitempty"`
    	ServiceType v1.ServiceType `json:"serviceType,omitempty"`
    	PodSpec     *PodSpec       `json:"podSpec,omitempty"`
    }
    SeldonConfig resource
    apiVersion: mlops.seldon.io/v1alpha1
    kind: SeldonRuntime
    metadata:
      name: seldon
      namespace: seldon-mesh
    spec:
      overrides:
      - name: hodometer
        podSpec:
          containers:
          - name: hodometer
            resources:
              limits:
                memory: 64Mi
                cpu: 20m
      seldonConfig: default
    apiVersion: mlops.seldon.io/v1alpha1
    kind: SeldonRuntime
    metadata:
      name: seldon
      namespace: seldon-mesh
    spec:
      seldonConfig: default
    An example that chains two models together is shown below:
    • steps allow you to specify the models you want to combine into a pipeline. Each step name will correspond to a model of the same name. These models will need to have been deployed and available for the Pipeline to function, however Pipelines can be deployed before or at the same time you deploy the underlying models.

    • steps.inputs allow you to specify the inputs to this step.

    • output.steps allow you to specify the output of the Pipeline. A pipeline can have multiple paths include flows of data that do not reach the output, e.g. Drift detection steps. However, if you wish to call your Pipeline in a synchronous manner via REST/gRPC then an output must be present so the Pipeline can be treated as a function.

    Expressing input data sources

    Model step inputs are defined with a dot notation of the form:

    Inputs with just a step name will be assumed to be step.outputs.

    The default payloads for Pipelines is the V2 protocol which requires named tensors as inputs and outputs from a model. If you require just certain tensors from a model you can reference those in the inputs, e.g. mymodel.outputs.t1 will reference the tensor t1 from the model mymodel.

    For the specification of the V2 protocol.

    Chain

    The simplest Pipeline chains models together: the output of one model goes into the input of the next. This will work out of the box if the output tensor names from a model match the input tensor names for the one being chained to. If they do not then the tensorMap construct presently needs to be used to define the mapping explicitly, e.g. see below for a simple chained pipeline of two tfsimple example models:

    In the above we rename tensor OUTPUT0 to INPUT0 and OUTPUT1 to INPUT1. This allows these models to be chained together. The shape and data-type of the tensors needs to match as well.

    This example can be found in the pipeline examples.

    Join

    Joining allows us to combine outputs from multiple steps as input to a new step.

    Caption: "Joining the outputs of two models into a third model. The dashed lines signify model outputs that are not captured in the output of the pipeline."

    Here we pass the pipeline inputs to two models and then take one output tensor from each and pass to the final model. We use the same tensorMap technique to rename tensors as disucssed in the previous section.

    Joins can have a join type which can be specified with inputsJoinType and can take the values:

    • inner: require all inputs to be available to join.

    • outer: wait for joinWindowMs to join any inputs. Ignoring any inputs that have not sent any data at that point. This will mean this step of the pipeline is guaranteed to have a latency of at least joinWindowMs.

    • any: wait for any of the specified data sources.

    This example can be found in the pipeline examples.

    Conditional Logic

    Pipelines can create conditional flows via various methods. We will discuss each in turn.

    Model routing via tensors

    The simplest way is to create a model that outputs different named tensors based on its decision. This way downstream steps can be dependant on different expected tensors. An example is shown below:

    Caption: "Pipeline with a conditional output model. The model conditional only outputs one of the two tensors, so only one path through the graph (red or blue) is taken by a single request"

    In the above we have a step conditional that either outputs a tensor named OUTPUT0 or a tensor named OUTPUT1. The mul10 step depends on an output in OUTPUT0 while the add10 step depends on an output from OUTPUT1.

    Note, we also have a final Pipeline output step that does an any join on these two models essentially outputting fron the pipeline whichever data arrives from either model. This type of Pipeline can be used for Multi-Armed bandit solutions where you want to route traffic dynamically.

    This example can be found in the pipeline examples.

    Errors

    Its also possible to abort pipelines when an error is produced to in effect create a condition. This is illustrated below:

    This Pipeline runs normally or throws an error based on whether the input tensors have certain values.

    Triggers

    Sometimes you want to run a step if an output is received from a previous step but not to send the data from that step to the model. This is illustrated below:

    Caption: "A pipeline with a single trigger. The model tfsimple3 only runs if the model check returns a tensor named OUTPUT. The green edge signifies that this is a trigger and not an additional input to tfsimple3. The dashed lines signify model outputs that are not captured in the output of the pipeline."

    In this example the last step tfsimple3 runs only if there are outputs from tfsimple1 and tfsimple2 but also data from the check step. However, if the step tfsimple3 is run it only receives the join of data from tfsimple1 and tfsimple2.

    This example can be found in the pipeline examples.

    Trigger Joins

    You can also define multiple triggers which need to happen based on a particulr join type. For example:

    Caption: "A pipeline with multiple triggers and a trigger join of type any. The pipeline has four inputs, but three of these are optional (signified by the dashed borders)."

    Here the mul10 step is run if data is seen on the pipeline inputs in the ok1 or ok2 tensors based on the any join type. If data is seen on ok3 then the add10 step is run.

    If we changed the triggersJoinType for mul10 to inner then both ok1 and ok2 would need to appear before mul10 is run.

    Pipeline Inputs

    Pipelines by default can be accessed synchronously via http/grpc or asynchronously via the Kafka topic created for them. However, it's also possible to create a pipeline to take input from one or more other pipelines by specifying an input section. If for example we already have the tfsimple pipeline shown below:

    We can create another pipeline which takes its input from this pipeline, as shown below:

    Caption: "A pipeline taking as input the output of another pipeline."

    In this way pipelines can be built to extend existing running pipelines to allow extensibility and sharing of data flows.

    The spec follows the same spec for a step except that references to other pipelines are contained in the externalInputs section which takes the form of pipeline or pipeline.step references:

    • <pipelineName>.(inputs|outputs).<tensorName>

    • <pipelineName>.(step).<stepName>.<tensorName>

    Tensor names are optional and only needed if you want to take just one tensor from an input or output.

    There is also an externalTriggers section which allows triggers from other pipelines.

    Further examples can be found in the pipeline-to-pipeline examples.

    Present caveats:

    • Circular dependencies are not presently detected.

    • Pipeline status is local to each pipeline.

    Data Centric Implementation

    Internally Pipelines are implemented using Kafka. Each input and output to a pipeline step has an associated Kafka topic. This has many advantages and allows auditing, replay and debugging easier as data is preserved from every step in your pipeline.

    Tracing allows you to monitor the processing latency of your pipelines.

    As each request to a pipelines moves through the steps its data will appear in input and output topics. This allows a full audit of every transformation to be carried out.

    Pipeline resource we provide for Kubernetes
    Scheduler API

    DALI

    Triton

    dali

    TBC

    Huggingface

    MLServer

    huggingface

    LightGBM

    MLServer

    lightgbm

    MLFlow

    MLServer

    mlflow

    ONNX

    Triton

    onnx

    OpenVino

    Triton

    openvino

    TBC

    Custom Python

    MLServer

    python, mlserver

    Custom Python

    Triton

    python, triton

    PyTorch

    Triton

    pytorch

    SKLearn

    MLServer

    sklearn

    Spark Mlib

    MLServer

    spark-mlib

    TBC

    Tensorflow

    Triton

    tensorflow

    TensorRT

    Triton

    tensorrt

    TBC

    Triton FIL

    Triton

    fil

    TBC

    XGBoost

    MLServer

    xgboost

    Saving Model artifacts

    For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.

    Type
    Notes
    Custom Model Settings

    Alibi-Detect

    .

    Alibi-Explain

    .

    DALI

    Follow the Triton docs to create a config.pbtxt and model folder with artifact.

    Huggingface

    Create an MLServer model-settings.json with the Huggingface model required

    Custom MLServer Model Settings

    For MLServer targeted models you can create a model-settings.json file to help MLServer load your model and place this alongside your artifact. See theMLServer project for details.

    Custom Triton Configuration

    For Triton inference server models you can createa configuration config.pbtxt file alongside your artifact.

    Notes

    The tag field represents the tag you need to add to the requirements part of the Model spec for your artifact to be loaded on a compatible server. e.g. for an sklearn model:

    Alibi-Detect

    MLServer

    alibi-detect

    example

    Alibi-Explain

    MLServer

    alibi-explain

    ServiceMonitors
    are in place for Seldon deployments. The analytics component is configured with the Prometheus integration. The monitoring for Seldon Core 2 is based on the Prometheus Operator and the related
    PodMonitor
    and
    PrometheusRule
    resources.

    Monitoring the model deployments in Seldon Core 2 involves:

    1. Installing kube-prometheus

    2. Configuring monitoring

    Prerequisites

    1. Install Seldon Core 2.

    2. Install Ingress Controller.

    3. Install Grafana in the namespace seldon-monitoring.

    Installing kube-prometheus

    1. Create a namespace for the monitoring components of Seldon Core 2.

    2. Create a YAML file to specify the initial configuration. For example, create the prometheus-values.yaml file. Use your preferred text editor to create and save the file with the following content:

      Note: Make sure to include metric-labels-allowlist: pods=[*] in the Helm values file. If you are using your own Prometheus Operator installation, ensure that the pods labels, particularly app.kubernetes.io/managed-by=seldon-core, are part of the collected metrics. These labels are essential for calculating deployment usage rules.

    3. Change to the directory that contains the prometheus-values file and run the following command to install version 9.5.12 of kube-prometheus.

      When the installation is complete, you should see this:

    4. Check the status of the installation.

      When the installation is complete, you should see this:

    Configuring monitoring for Seldon Core 2

    1. You can access Prometheus from outside the cluster by running the following commands:

    2. You can access Alertmanager from outside the cluster by running the following commands:

    3. Apply the Custom RBAC Configuration settings for kube-prometheus.

    4. Configure metrics collection by createing the following PodMonitor resources.

      When the resources are created, you should see this:

    Next

    Prometheus User Interface

    You may now be able to check the status of Seldon components in Prometheus:

    1. Open your browser and navigate to http://127.0.0.1:9090/ to access Prometheus UI from outside the cluster.

    2. Go to Status and select Targets.

    The status of all the endpoints and the scrape details are displayed.

    Grafana

    You can view the metrics in Grafana Dashboard after you set Prometheus as the Data Source, and import seldon.json dashboard located at seldon-core/v2.8.2/prometheus/dashboards in GitHub repository.

    Quickstart

    In this notebook, we will demonstrate how to deploy a production-ready AI application with Seldon Core 2. This application will have two components - an sklearn model and a preprocessor written in Python - leveraging Core 2 Pipelines to connect the two. Once deployed, users will have an endpoint available to call the deployed application. The inference logic can be visualized with the following diagram:

    To do this we will:

    1. Set up a Server resource to deploy our models

    2. Deploy an sklearn Model

    3. Deploy a multi-step Pipeline, including a preprocessing step that will be run before calling our model.

    4. Call our inference endpoint, and observe data within our pipeline

    Setup: In order to run this demo, you need to connect to a cluster set up with an installation of Core 2 (see ). We will be using the kubectl command line tool to interact with the Kubernetes cluster's control plane. Lastly, we will be using the gcloud CLI to pull models from Seldon's cloud storage, where we provide sample models and files. Once you are set up, you can run this demo as a pre-built jupyter notebook by accessing it in our github repo (the v2 branch), under docs-gb/getting-started/quickstart/quickstart.ipynb

    Step 1: Deploy a Custom Server

    As part of the Core 2 installation, you will have install MLServer and Triton Servers:

    The server resource outlines attributes (dependency requirements, underlying infrastrucuture) for the runtimes that the models you deploy will run on. By default, MLServer supports the following frameworks out of the box: alibi-detect, alibi-explain, huggingface, lightgbm, mlflow, python, sklearn, spark-mlib, xgboost

    In this example, we will create a new custom MLServer that we will tag with income-classifier-deps under capabilities (see docs in order to define which Models will be matched to this Server. In this example, we will deploy both our model (sklearn) and our preproccesor (python) on the same Server. This is done using the manifest below:

    Step 2: Deploy Models

    Now we will deploy a model - in this case, we are deploying a categorical model that has been trained to take 12 features as input, and output [0] or [1], representing a [Yes] or [No] prediction of whether or not an adult with certain values for the 12 features is making more than $50K/yr. This model was trained using the Census Income (or "Adult") Dataset. Extraction was done by Barry Becker from the 1994 Census database. See for more details.

    The model artefact is currently stored in Seldon's a Google bucket - the contents of the relevant folder are below. Alongside our model artefact, we have a model-settings.json file to help locate and load the model. For more information on the Inference artefacts we support and how to configure them, see .

    In our Model manifest below, we point to the location of the model artefact using the storageUri field. You will also notice that we have defined income-classifier-deps under requirements. This will match the Model to the Server we deployed above, as Models will only be deployed onto Servers that have capabilities that match the appropriate requirements defined in the Model manifest.

    In order to deploy the model, we will apply the manifest to our cluster:

    We now have a deployed model, with an associated endpoint.

    Make Requests

    The endpoint that has been exposed by the above deployment will use an IP from our service mesh that we can obtain as follows:

    Requests are made using the Open Inference Protocol. More details on this specification can be found in our , or in the API documentation generated by our protocol buffers in the case of gRPC usage . This protocol is also supported by shared by Triton Inference Server for serving Deep Learning models.

    We are now ready to send a request!

    We can see above that the model returned a 'data': [0] in the output. This is the prediction of the model, indicating that an individual with the attributes provided is most likely making more than $50K/yr.

    Step 3: Create and Deploy a 2-step Pipeline

    Often we'd like to deploy AI applications that are more complex than just an individual model. For example, around our model we could consider deploying pre or post-processing steps, custom logic, other ML models, or drift and outlier detectors.

    Deploy a Preprocessing step written in Python

    In this example, we will create a preprocessing step that extracts numerical values from a text file for the model to use as input. This will be implemented with custom logic using Python, and deployed as custom model with MLServer:

    Before deploying the preprocessing step with Core 2, we will test it locally:

    Now that we've tested the python script locally, we will deploy the preprocessing step as a Model. This will allow us to connect it to our sklearn model using a Seldon Pipeline. To do so, we store in our cloud storage an inference artefact (in this case, our Python script) alongside a model-settings.json file, similar to the model deployed above.

    As with the ML model deployed above, we have defined income-classifier-deps under requirements. This means that both the preprocesser and the model will be deployed using the same Server, enabling consolidation in terms of the resources and overheads used (for more about Multi-Model Serving, see ).

    We've now deployed the prepocessing step! Let's test it out by calling the endpoint for it:

    Create and Deploy a Pipeline connecting our deployed Models

    Now that we have our preprocessor and model deployed, we will chain them together with a pipeline.

    The yaml defines two steps in a pipeline (the preprocessor and model), mapping the outputs of the preprocessor model (OUTPUT0) to the input of the income classification model (INPUT0). Seldon Core will leverage Kafka to communicate between models, meaning that all data is streamed and observable in real time.

    You will notice that sending a request to the pipeline is achieved by defining income-classifier-app.pipeline as the value for Seldon-Model in the headers of the request.

    Congratulations! You have now deployed a Seldon Pipeline that exposes an endpoint for you ML application 🥳. For more tutorials on how to use Core 2 for various use-cases and requirements, see .

    Clean Up

    Explainer examples

    Learn how to implement model explainability in Seldon Core using Anchor explainers for both tabular and text data

    Explainer Examples

    Anchor Tabular Explainer for SKLearn Income Model

    kubectl apply -f ./models/income.yaml -n ${NAMESPACE}
    pipeline.mlops.seldon.io/income created
    kubectl wait --for condition=ready --timeout=300s model income -n ${NAMESPACE}
    model.mlops.seldon.io/income condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/income/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income_1",
    	"model_version": "1",
    	"id": "c65b8302-85af-4bac-aac5-91e3bedebee8",
    	"parameters": {},
    	"outputs": [
    
    

    Anchor Text Explainer for SKLearn Movies Sentiment Model

    Inference

    When looking at optimizing the latency or throughput of deployed models or pipelines, it is important to consider different approaches to the execution of inference workloads. Below are some tips on different approaches that may be relevant depending on the requirements of a given use-case:

    • gRPC may be more efficient than REST when your inference request payload benefits from a binary serialization format.

    • Grouping multiple real-time requests into small batches can improve throughput while maintaining acceptable latency. For more information on adaptive batching in MLServer, see here.

    • Reducing input size by reducing the dimensions of your inputs can speed up processing. This also reduces (de)serialization overhead that might be needed around model deployments.

    • For models deployed with MLServer, adjust parallel_workers in line with the number of CPU cores assigned to the Server pod. This is most effective for synchronous models, CPU-bound asynchronous models, and I/O-bound asynchronous models with high throughput. Proper tuning here can improve throughput, stabilize latency distribution, and potentially reduce overall latency due to reduced queuing. This is outlined in more detail in section below.

    Configuring Parallel Processing

    When deploying models using MLServer, it is possible to execute inference workloads via a pool of workers running in separate processes (see in MLServer docs ).

    To assess the throughput behavior of individual model(s) it is first helpful to identify the maximum throughput possible with one worker (one-worker max throughput) and then the maximum throughput possible with N workers (n_workers max throughput). It is important to note that the n_workers max throughput is not simply n_workers × one-worker max throughput because workers run in separate processes, and the OS can only run as many processes in parallel as there are available CPUs. If all workers are CPU-bound, then setting n_workers higher than the number of CPU cores will be ineffective, as the OS will be limited by the number of CPU cores in terms of processes available to parallelize.

    However, if some workers are waiting for either I/O or for a GPU, then setting n_workers to a value higher than the number of CPUs can help increase throughput, as some workers can then continue processing while others wait. Generally, if a model is receiving inference requests at a throughput that is lower than the one-worker max throughput, then adding additional workers will not help increase throughput or decrease latency. Similarly, if MLServer is configured with n_workers (on a pod with more than n_CPUs) and the request rate is below the n_workers worker max throughput, latency remains constant - the system is below saturation.

    Given the above, it is worth considering increasing the number of workers available to process data for a given deployed model when the system becomes saturated. Increasing the number of workers up to or slightly above the number of CPU cores available may reduce latency when the system is saturated, provided the MLServer pod has sufficient spare CPU. The effect of increasing workers also depends on whether the model is CPU-bound or uses async versus blocking operations, where CPU-bound and blocking models would benefit most. When the system is saturated with requests, those requests will queue. Increasing workers aims to run enough tasks in parallel to cope with higher throughput while minimizing queuing.

    Optimizing the model artefact

    If optimizing for speed, the model artefact itself can have a big impact on performance. The speed at which an ML model can return results given input is based on the model’s architecture, model size, the precision of the model’s weights, and input size. In order to reduce the inherent complexity in the data processing required to execute an inference due to the attributes of a model, it is worth considering:

    • Model pruning to reduce parameters that may be unimportant. This can help reduce model size without having a big impact on the quality of the model’s outputs.

    • Quantization to reduce the computational and memory overheads of running inference by using model weights and activations with lower precision data types.

    • Dimensionality reduction of inputs to reduce the complexity of computation.

    • Efficient model architectures

    Infrastructure Setup

    CPUs vs. GPUs

    Overall performance of your models will be constrained by the specifications of the underlying hardware which it is run on, and how it is leveraged. Choosing between CPUs and GPUs depends on the latency and throughput requirements for your use-case, as well as the type of model you are putting into production:

    • CPUs are generally sufficient for lightweight models, such as tree-based models, regression models, or small neural networks.

    • GPUs are recommended for deep learning models, large-scale neural networks, and large language models (LLMs) where lower latency is critical. Models with high matrix computation demands (like transformers or CNNs) benefit significantly from GPU acceleration.

    If cost is a concern, it is recommended to start with CPUs and use profiling or performance monitoring tools (e.g. as py-spy or scalene) to identify CPU bottlenecks. Based on these results, you can transition to GPUs as needed.

    Setting the right specifications for CPUs

    If working with models that will receive many concurrent requests in production, individual CPUs can often act as bottlenecks when processing data. In these cases, increasing the parallel workers can help. This can be configured through your serving solution as described in the section. It is important to note that when increasing the number of workers available to process concurrent requests, it is best practice to ensure the number of workers is not significantly higher than the number of available CPU cores, in order to reduce contention. Each worker executes in it’s own process. This is most relevant for synchronous models where subsequent processing is blocked on completion of each request.

    For more advanced configuration of CPU utilization, users can configure thread affinity through environment variables which determine how threads are bound to physical processors. For example, KMP_AFFINITY and OMP_NUM_THREADS are variables that can be set for technologies that use OpenMP. For more information on thread affinity, see . In general, the ML Framework that you’re using might have it’s own recommendations for improving resource usage.

    Finally, increasing the RAM available for your models can improve performance for memory intensive models, such as models with large parameter sizes, ones that require high-dimensional data processing, or involve complex intermediate computations.

    Models

    Tritonclient examples

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see .

    To install tritonclient

    Local Metrics

    Learn how to test and validate metrics collection in Seldon Core locally, including Prometheus setup and Grafana dashboards.

    Run these examples from the samples folder at the root of the repo.

    This notebook tests the exposed Prometheus metrics of model and pipeline servers.

    Requires: prometheus_client and requests libraries. See docs for full set of metrics available.

    MLServer Model

    Autoscaling Models

    Learn how to leverage Core 2's native autoscaling functionality for Models

    In order to set up autoscaling, users should first identify which metric they would want to scale their models on. Seldon Core provides an approach to autoscale models based on Inference Lag, or supports more custom scaling logic by leveraging HPA, (or ), whereby you can use custom metrics to automatically scale Kubernetes resources. This page will go through the first approach. Inference Lag refers to the difference in incoming vs. outgoing requests in a given period of time. If choosing this approach, it is recommended to configure autoscaling for Servers, so that Models scale on Inference Lag, and in turn set up to scale based on model needs.

    This implementation of autoscaling is enabled if Core 2 is installed with the autoscaling.autoscalingModelEnabled helm value set to true (default is false) and at least MinReplicas or MaxReplicas is set in the Model Custom Resource. Then according to lag (how much the model "falls behind" in terms of serving inference requests) the system will scale the number of Replicas

    Operational Metrics

    Learn how to monitor operational metrics in Seldon Core, including model performance, pipeline health, and system resource usage.

    While the system runs, Prometheus collects metrics that enable you to observe various aspects of Seldon Core 2, including throughput, latency, memory, and CPU usage. In addition to the standard Kubernetes metrics scraped by Prometheus, a provides a comprehensive system overview.

    List of Seldon Core 2 metrics

    The list of Seldon Core 2 metrics that are compiling is as follows.

    For the agent that sits next to the inference servers:

    For the pipeline gateway that handles requests to pipelines:

    Many of these metrics are model and pipeline level counters and gauges. Some of these metrics are aggregated to speed up the display of graphs. Currently,per-model histogram metrics are not stored for performance reasons. However, per-pipeline histogram metrics are stored.

    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: chain
      namespace: seldon-mesh
    spec:
      steps:
        - name: model1
        - name: model2
          inputs:
          - model1
      output:
        steps:
        - model2
    <stepName>|<pipelineName>.<inputs|outputs>.<tensorName>
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: join
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
        - name: tfsimple3
          inputs:
          - tfsimple1.outputs.OUTPUT0
          - tfsimple2.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple2.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple3
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-conditional
    spec:
      steps:
      - name: conditional
      - name: mul10
        inputs:
        - conditional.outputs.OUTPUT0
        tensorMap:
          conditional.outputs.OUTPUT0: INPUT
      - name: add10
        inputs:
        - conditional.outputs.OUTPUT1
        tensorMap:
          conditional.outputs.OUTPUT1: INPUT
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: error
    spec:
      steps:
        - name: outlier-error
      output:
        steps:
        - outlier-error
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: joincheck
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
        - name: check
          inputs:
          - tfsimple1.outputs.OUTPUT0
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT
        - name: tfsimple3
          inputs:
          - tfsimple1.outputs.OUTPUT0
          - tfsimple2.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple2.outputs.OUTPUT1: INPUT1
          triggers:
          - check.outputs.OUTPUT
      output:
        steps:
        - tfsimple3
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: trigger-joins
    spec:
      steps:
      - name: mul10
        inputs:
        - trigger-joins.inputs.INPUT
        triggers:
        - trigger-joins.inputs.ok1
        - trigger-joins.inputs.ok2
        triggersJoinType: any
      - name: add10
        inputs:
        - trigger-joins.inputs.INPUT
        triggers:
        - trigger-joins.inputs.ok3
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    # samples/models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
    fullnameOverride: seldon-monitoring
    kube-state-metrics:
      extraArgs:
        metric-labels-allowlist: pods=[*]
    echo "Prometheus URL: http://127.0.0.1:9090/"
    kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
    echo "Alertmanager URL: http://127.0.0.1:9093/"
    kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093
    CUSTOM_RBAC=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2/prometheus/rbac
    
    kubectl apply -f ${CUSTOM_RBAC}/cr.yaml
    PODMONITOR_RESOURCE_LOCATION=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/monitors
    
    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/agent-podmonitor.yaml
    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/envoy-servicemonitor.yaml
    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/pipelinegateway-podmonitor.yaml
    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/server-podmonitor.yaml
    cat ./models/income.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
      requirements:
      - sklearn
    

    docs

    LightGBM

    Save model to file with extension.bst.

    docs

    MLFlow

    Use the created artifacts/model folder from your training run.

    docs

    ONNX

    Save you model with name model.onnx.

    docs

    OpenVino

    Follow the Triton docs to create your model artifacts.

    docs

    Custom MLServer Python

    Create a python file with a class that extends MLModel.

    docs

    Custom Triton Python

    Follow the Triton docs to create your config.pbtxt and associated python files.

    docs

    PyTorch

    Create a Triton config.pbtxt describing inputs and outputs and place traced torchscript in folder as model.pt.

    docs

    SKLearn

    Save model via joblib to a file with extension .joblib or with pickle to a file with extension .pkl or .pickle.

    docs

    Spark Mlib

    Follow the MLServer docs.

    docs

    Tensorflow

    Save model in "Saved Model" format as model.savedodel. If using graphdef format you will need to create Triton config.pbtxt and place your model in a numbered sub folder. HDF5 is not supported.

    docs

    TensorRT

    Follow the Triton docs to create your model artifacts.

    docs

    Triton FIL

    Follow the Triton docs to create your model artifacts.

    docs

    XGBoost

    Save model to file with extension.bst or .json.

    docs

    example
    example
    example
    example
    example
    example
    example
    example
    example
    example
    example
    Save model using Alibi-Detect
    docs
    Save model using Alibi-Explain
    docs
    docs
    such as MobileNet, EfficientNet, or DistilBERT, which are designed for faster inference with minimal accuracy loss.
  • Optimized model formats and runtimes like ONNX Runtime, TensorRT, or OpenVINO, which leverage hardware-specific acceleration for improved performance.

  • here
    Parallel Inference
    Inference
    here
    Leveraging Multiple CPU Cores
    {
    "name": "predict",
    "shape": [
    1,
    1
    ],
    "datatype": "INT64",
    "data": [
    0
    ]
    }
    ]
    }

    This is experimental, and these metrics are expected to evolve to better capture relevant trends as more information becomes available about system usage.

    Local Metrics Examples

    An example to show raw metrics that Prometheus will scrape.

    // scheduler/pkg/metrics/agent.go
    const (
    	// Histograms do no include pipeline label for efficiency
    	modelHistogramName = "seldon_model_infer_api_seconds"
    	// We use base infer counters to store core metrics per pipeline
    	modelInferCounterName                 = "seldon_model_infer_total"
    	modelInferLatencyCounterName          = "seldon_model_infer_seconds_total"
    	modelAggregateInferCounterName        = "seldon_model_aggregate_infer_total"
    	modelAggregateInferLatencyCounterName = "seldon_model_aggregate_infer_seconds_total"
    )
    
    // Agent metrics
    const (
    	cacheEvictCounterName                              = "seldon_cache_evict_count"
    	cacheMissCounterName                               = "seldon_cache_miss_count"
    	loadModelCounterName                               = "seldon_load_model_counter"
    	unloadModelCounterName                             = "seldon_unload_model_counter"
    	loadedModelGaugeName                               = "seldon_loaded_model_gauge"
    	loadedModelMemoryGaugeName                         = "seldon_loaded_model_memory_bytes_gauge"
    	evictedModelMemoryGaugeName                        = "seldon_evicted_model_memory_bytes_gauge"
    	serverReplicaMemoryCapacityGaugeName               = "seldon_server_replica_memory_capacity_bytes_gauge"
    	serverReplicaMemoryCapacityWithOverCommitGaugeName = "seldon_server_replica_memory_capacity_overcommit_bytes_gauge"
    )
    Grafana dashboard
    Tritonclient Examples with Seldon Core 2
    • Note: for compatibility of Tritonclient check this issue.

    With MLServer

    • Note: binary data support in HTTP is blocked by this issue

    Deploy Model and Pipeline

    HTTP Transport Protocol

    GRPC Transport Protocol

    Against Model

    Against Pipeline

    With Tritonserver

    • Note: binary data support in HTTP is blocked by https://github.com/SeldonIO/seldon-core-v2/issues/475

    Deploy Model and Pipeline

    HTTP Transport Protocol

    GRPC Transport Protocol

    Cleanup

    Seldon CLI
    Triton Model

    Load the model.

    Pipeline

    mlserver_metrics_host="0.0.0.0:9006"
    triton_metrics_host="0.0.0.0:9007"
    pipeline_metrics_host="0.0.0.0:9009"
    seldon model load -f ./models/sklearn-iris-gs.yaml
    seldon model status iris -w ModelAvailable | jq -M .
    {}
    {}
    within this range. As an example, the following model will be deployed at first with 1 replica and will autoscale according to lag.

    When the system autoscales, the initial model spec is not changed (e.g. the number of replicas) and therefore the user cannot reset the number of replicas back to the initial specified value without an explicit change to a different value first. If only replicas is specified by the user, autoscaling of models is disabled and the system will have exactly the number of replicas of this model deployed regardless of inference lag.

    The scale-up and scale-down logic, and it's configurability is described below:

    • Scale Up: To trigger scale up with the approach described above, we use Inference Lag as the metrics. Inference Lag is the difference between incoming and outgoing requests in a given time period. If the lag crosses a threshold, then we trigger a model scale up event. This threshold can be defined via SELDON_MODEL_INFERENCE_LAG_THRESHOLD inference server environment variable. The threshold used will apply to all the models hosted on the Server where the lag was configured.

    • Scale Down: When using Model autoscaling that is managed by Seldon Core, model scale down events are triggered if a model has not been used for a number of seconds. This is defined in SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD inference server environment variable.

    • Rate of metrics calculation: Each agent checks the above stats periodically and if any model hits the corresponding threshold, then the agent sends an event to the scheduler to request model scaling. How often this process executes can be defined via SELDON_SCALING_STATS_PERIOD_SECONDS inference server environment variable.

    Based on the logic above, the scheduler will trigger model autoscaling if:

    • The model is stable (no state change in the last 5 minutes) and available.

    • The desired number of replicas is within range. Note we always have a least 1 replica of any deployed model and we rely on over commit to reduce the resources used further.

    • For scaling up the model when autoscaling of the Servers is not set up, trigger the scale-up only if there are sufficient server replicas that can load the new model replicas.

    If autoscaling models with the approach above, it is recommended to autoscale servers based on using Seldon's Server autoscaling (configured by setting MinReplicas and MaxReplicas for the Server CR - see below). Without Server autoscaling configured, the required number of servers will not necessarily spin up, even if the desired number of model replicas cannot be currently fulfilled by the current provisioned number of servers. Setting up Server Autoscaling is described in more detail below.

    Horizontal Pod Autoscaler
    autoscaling for Servers
    helm upgrade --install prometheus kube-prometheus \
     --version 9.5.12 \
     --namespace seldon-monitoring \
     --values prometheus-values.yaml \
     --repo https://charts.bitnami.com/bitnami
    WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
      - alertmanager.resources
      - blackboxExporter.resources
      - operator.resources
      - prometheus.resources
      - prometheus.thanos.resources
    +info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
    
    kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator
    Waiting for deployment "seldon-monitoring-operator" rollout to finish: 0 of 1 updated replicas are available...
    deployment "seldon-monitoring-operator" successfully rolled out
    podmonitor.monitoring.coreos.com/agent created
    servicemonitor.monitoring.coreos.com/envoy created
    podmonitor.monitoring.coreos.com/pipelinegateway created
    podmonitor.monitoring.coreos.com/server created
    seldon model unload mnist-pytorch
    seldon model load -f ./models/income.yaml
    {}
    seldon model status income -w ModelAvailable
    {}
    seldon model infer income \
      '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income_1",
    	"model_version": "1",
    	"id": "c65b8302-85af-4bac-aac5-91e3bedebee8",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"data": [
    				0
    			]
    		}
    	]
    }
    
    cat ./models/income-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
      explainer:
        type: anchor_tabular
        modelRef: income
    
    kubectl apply -f ./models/income-explainer.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/income-explainer created
    kubectl wait --for condition=ready --timeout=300s model income-explainer -n ${NAMESPACE}
    model.mlops.seldon.io/income-explainer condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income-explainer_1",
    	"model_version": "1",
    	"id": "a22c3785-ff3b-4504-9b3c-199aa48a62d6",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "explanation",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "str"
    			},
    			"data": [
    				"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.0\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9518716577540107, \"coverage\": 0.07165109034267912, \"raw\": {\"feature\": [3, 5], \"mean\": [0.7959381044487428, 0.9518716577540107], \"precision\": [0.7959381044487428, 0.9518716577540107], \"coverage\": [0.3037383177570093, 0.07165109034267912], \"examples\": [{\"covered_true\": [[52, 5, 5, 1, 8, 1, 2, 0, 0, 0, 50, 9], [49, 4, 1, 1, 4, 4, 1, 0, 0, 0, 40, 1], [23, 4, 1, 1, 6, 1, 4, 1, 0, 0, 40, 9], [55, 2, 1, 1, 5, 1, 4, 0, 0, 0, 48, 9], [22, 4, 1, 1, 2, 3, 4, 0, 0, 0, 15, 9], [51, 4, 2, 1, 5, 0, 1, 1, 0, 0, 99, 4], [40, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [40, 6, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [50, 5, 5, 1, 6, 0, 4, 1, 0, 0, 55, 9], [41, 4, 1, 1, 6, 0, 4, 1, 0, 0, 40, 9]], \"covered_false\": [[42, 4, 1, 1, 8, 0, 4, 1, 0, 2415, 60, 9], [48, 6, 2, 1, 5, 4, 4, 0, 0, 0, 60, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [57, 4, 5, 1, 8, 0, 4, 1, 0, 0, 50, 9], [63, 7, 2, 1, 8, 0, 4, 1, 0, 1902, 50, 9], [51, 4, 5, 1, 8, 0, 4, 1, 0, 1887, 47, 9], [51, 2, 2, 1, 8, 1, 4, 0, 0, 0, 45, 9], [68, 7, 5, 1, 5, 0, 4, 1, 0, 2377, 42, 0], [45, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 40, 9], [45, 4, 1, 1, 8, 0, 4, 1, 0, 1977, 60, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[44, 6, 5, 1, 8, 3, 4, 0, 0, 1902, 60, 9], [58, 7, 2, 1, 5, 3, 1, 1, 4064, 0, 40, 1], [50, 7, 1, 1, 1, 3, 2, 0, 0, 0, 37, 9], [34, 4, 2, 1, 5, 3, 4, 1, 0, 0, 45, 9], [45, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [33, 7, 5, 1, 5, 3, 1, 1, 0, 0, 30, 6], [61, 7, 2, 1, 5, 3, 4, 1, 0, 0, 40, 0], [35, 4, 5, 1, 1, 3, 4, 1, 0, 0, 40, 9], [71, 2, 1, 1, 5, 3, 4, 0, 0, 0, 6, 9], [44, 4, 1, 1, 8, 3, 2, 1, 0, 0, 35, 9]], \"covered_false\": [[30, 4, 5, 1, 5, 3, 4, 1, 10520, 0, 40, 9], [54, 7, 2, 1, 8, 3, 4, 1, 0, 1902, 50, 9], [66, 6, 2, 1, 6, 3, 4, 1, 0, 2377, 25, 9], [35, 4, 2, 1, 5, 3, 4, 1, 7298, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7298, 0, 48, 9], [31, 4, 1, 1, 8, 3, 4, 0, 13550, 0, 50, 9], [35, 4, 1, 1, 8, 3, 4, 1, 8614, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
    			]
    		}
    	]
    }
    
    kubectl delete -f ./models/income-explainer.yaml -n ${NAMESPACE}
    kubectl delete -f ./models/income.yaml -n ${NAMESPACE}
    
    seldon model load -f ./models/income-explainer.yaml
    {}
    
    seldon model status income-explainer -w ModelAvailable
    {}
    
    seldon model infer income-explainer \
      '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income-explainer_1",
    	"model_version": "1",
    	"id": "a22c3785-ff3b-4504-9b3c-199aa48a62d6",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "explanation",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "str"
    			},
    			"data": [
    				"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.0\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9518716577540107, \"coverage\": 0.07165109034267912, \"raw\": {\"feature\": [3, 5], \"mean\": [0.7959381044487428, 0.9518716577540107], \"precision\": [0.7959381044487428, 0.9518716577540107], \"coverage\": [0.3037383177570093, 0.07165109034267912], \"examples\": [{\"covered_true\": [[52, 5, 5, 1, 8, 1, 2, 0, 0, 0, 50, 9], [49, 4, 1, 1, 4, 4, 1, 0, 0, 0, 40, 1], [23, 4, 1, 1, 6, 1, 4, 1, 0, 0, 40, 9], [55, 2, 1, 1, 5, 1, 4, 0, 0, 0, 48, 9], [22, 4, 1, 1, 2, 3, 4, 0, 0, 0, 15, 9], [51, 4, 2, 1, 5, 0, 1, 1, 0, 0, 99, 4], [40, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [40, 6, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [50, 5, 5, 1, 6, 0, 4, 1, 0, 0, 55, 9], [41, 4, 1, 1, 6, 0, 4, 1, 0, 0, 40, 9]], \"covered_false\": [[42, 4, 1, 1, 8, 0, 4, 1, 0, 2415, 60, 9], [48, 6, 2, 1, 5, 4, 4, 0, 0, 0, 60, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [57, 4, 5, 1, 8, 0, 4, 1, 0, 0, 50, 9], [63, 7, 2, 1, 8, 0, 4, 1, 0, 1902, 50, 9], [51, 4, 5, 1, 8, 0, 4, 1, 0, 1887, 47, 9], [51, 2, 2, 1, 8, 1, 4, 0, 0, 0, 45, 9], [68, 7, 5, 1, 5, 0, 4, 1, 0, 2377, 42, 0], [45, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 40, 9], [45, 4, 1, 1, 8, 0, 4, 1, 0, 1977, 60, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[44, 6, 5, 1, 8, 3, 4, 0, 0, 1902, 60, 9], [58, 7, 2, 1, 5, 3, 1, 1, 4064, 0, 40, 1], [50, 7, 1, 1, 1, 3, 2, 0, 0, 0, 37, 9], [34, 4, 2, 1, 5, 3, 4, 1, 0, 0, 45, 9], [45, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [33, 7, 5, 1, 5, 3, 1, 1, 0, 0, 30, 6], [61, 7, 2, 1, 5, 3, 4, 1, 0, 0, 40, 0], [35, 4, 5, 1, 1, 3, 4, 1, 0, 0, 40, 9], [71, 2, 1, 1, 5, 3, 4, 0, 0, 0, 6, 9], [44, 4, 1, 1, 8, 3, 2, 1, 0, 0, 35, 9]], \"covered_false\": [[30, 4, 5, 1, 5, 3, 4, 1, 10520, 0, 40, 9], [54, 7, 2, 1, 8, 3, 4, 1, 0, 1902, 50, 9], [66, 6, 2, 1, 6, 3, 4, 1, 0, 2377, 25, 9], [35, 4, 2, 1, 5, 3, 4, 1, 7298, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7298, 0, 48, 9], [31, 4, 1, 1, 8, 3, 4, 0, 13550, 0, 50, 9], [35, 4, 1, 1, 8, 3, 4, 1, 8614, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
    			]
    		}
    	]
    }
    
    seldon model unload income-explainer
    {}
    
    seldon model unload income
    {}
    
    cat ./models/moviesentiment.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/moviesentiment-sklearn"
      requirements:
      - sklearn
    
    kubectl apply -f ./models/moviesentiment.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment created
    kubectl wait --for condition=ready --timeout=300s model sentiment -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'
    {
    	"model_name": "sentiment_2",
    	"model_version": "1",
    	"id": "f5c07363-7e9d-4f09-aa30-228c81fdf4a4",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				0
    			]
    		}
    	]
    }
    
    seldon model load -f ./models/moviesentiment.yaml
    {}
    
    seldon model status sentiment -w ModelAvailable
    {}
    
    seldon model infer sentiment \
      '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'
    {
    	"model_name": "sentiment_2",
    	"model_version": "1",
    	"id": "f5c07363-7e9d-4f09-aa30-228c81fdf4a4",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				0
    			]
    		}
    	]
    }
    
    cat ./models/moviesentiment-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/moviesentiment-sklearn-explainer"
      explainer:
        type: anchor_text
        modelRef: sentiment
    
    kubectl apply -f ./models/moviesentiment-explainer.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-explainer created
    kubectl wait --for condition=ready --timeout=300s model sentiment-explainer -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-explainer condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'
    seldon model load -f ./models/moviesentiment-explainer.yaml
    {}
    
    seldon model status sentiment-explainer -w ModelAvailable
    {}
    
    seldon model infer sentiment-explainer \
      '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'
    Error: V2 server error: 500 Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
        await self.app(scope, receive, _send)
      File "/opt/conda/lib/python3.8/site-packages/starlette_exporter/middleware.py", line 307, in __call__
        await self.app(scope, receive, wrapped_send)
      File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/gzip.py", line 24, in __call__
        await responder(scope, receive, send)
      File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/gzip.py", line 44, in __call__
        await self.app(scope, receive, self.send_with_gzip)
      File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
        raise exc
      File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
        await self.app(scope, receive, sender)
      File "/opt/conda/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
        raise e
      File "/opt/conda/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
        await self.app(scope, receive, send)
      File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 706, in __call__
        await route.handle(scope, receive, send)
      File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
        await self.app(scope, receive, send)
      File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
        response = await func(request)
      File "/opt/conda/lib/python3.8/site-packages/mlserver/rest/app.py", line 42, in custom_route_handler
        return await original_route_handler(request)
      File "/opt/conda/lib/python3.8/site-packages/fastapi/routing.py", line 237, in app
        raw_response = await run_endpoint_function(
      File "/opt/conda/lib/python3.8/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
        return await dependant.call(**values)
      File "/opt/conda/lib/python3.8/site-packages/mlserver/rest/endpoints.py", line 99, in infer
        inference_response = await self._data_plane.infer(
      File "/opt/conda/lib/python3.8/site-packages/mlserver/handlers/dataplane.py", line 103, in infer
        prediction = await model.predict(payload)
      File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/runtime.py", line 86, in predict
        output_data = await self._async_explain_impl(input_data, payload.parameters)
      File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/runtime.py", line 119, in _async_explain_impl
        explanation = await loop.run_in_executor(self._executor, explain_call)
      File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 62, in _explain_impl
        input_data = input_data[0]
    KeyError: 0
    
    kubectl delete -f ./models/moviesentiment-explainer.yaml -n ${NAMESPACE}
    kubectl delete -f ./models/moviesentiment.yaml -n ${NAMESPACE}
    seldon model unload sentiment-explainer
    {}
    seldon model unload sentiment
    {}
    // scheduler/pkg/metrics/gateway.go
    // The aggregate metrics exist for efficiency, as the summation can be
    // very slow in Prometheus when many pipelines exist.
    const (
    	// Histograms do no include model label for efficiency
    	pipelineHistogramName = "seldon_pipeline_infer_api_seconds"
    	// We use base infer counters to store core metrics per pipeline
    	pipelineInferCounterName                 = "seldon_pipeline_infer_total"
    	pipelineInferLatencyCounterName          = "seldon_pipeline_infer_seconds_total"
    	pipelineAggregateInferCounterName        = "seldon_pipeline_aggregate_infer_total"
    	pipelineAggregateInferLatencyCounterName = "seldon_pipeline_aggregate_infer_seconds_total"
    )
    pip install tritonclient[all]
    import os
    os.environ["NAMESPACE"] = "seldon-mesh"
    MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP=MESH_IP[0]
    import os
    os.environ['MESH_IP'] = MESH_IP
    MESH_IP
    '172.19.255.1'
    
    cat models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    cat pipelines/iris.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: iris-pipeline
    spec:
      steps:
        - name: iris
      output:
        steps:
        - iris
    
    kubectl apply -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    kubectl apply -f pipelines/iris.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    pipeline.mlops.seldon.io/iris-pipeline created
    
    kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s pipelines iris-pipeline -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    pipeline.mlops.seldon.io/iris-pipeline condition met
    
    import tritonclient.http as httpclient
    import numpy as np
    
    http_triton_client = httpclient.InferenceServerClient(
        url=f"{MESH_IP}:80",
        verbose=False,
    )
    
    print("model ready:", http_triton_client.is_model_ready("iris"))
    print("model metadata:", http_triton_client.get_model_metadata("iris"))
    model ready: True
    model metadata: {'name': 'iris_1', 'versions': [], 'platform': '', 'inputs': [], 'outputs': [], 'parameters': {}}
    
    # Against model
    
    binary_data = False
    
    inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
    inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"), binary_data=binary_data)
    
    outputs = [httpclient.InferRequestedOutput("predict", binary_data=binary_data)]
    
    result = http_triton_client.infer("iris", inputs, outputs=outputs)
    result.as_numpy("predict")
    array([[2]])
    
    # Against pipeline
    
    binary_data = False
    
    inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
    inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"), binary_data=binary_data)
    
    outputs = [httpclient.InferRequestedOutput("predict", binary_data=binary_data)]
    
    result = http_triton_client.infer("iris-pipeline.pipeline", inputs, outputs=outputs)
    result.as_numpy("predict")
    array([[2]])
    
    import tritonclient.grpc as grpcclient
    import numpy as np
    
    grpc_triton_client = grpcclient.InferenceServerClient(
        url=f"{MESH_IP}:80",
        verbose=False,
    )
    model_name = "iris"
    headers = {"seldon-model": model_name}
    
    print("model ready:", grpc_triton_client.is_model_ready(model_name, headers=headers))
    print(grpc_triton_client.get_model_metadata(model_name, headers=headers))
    model ready: True
    name: "iris_1"
    
    model_name = "iris"
    headers = {"seldon-model": model_name}
    
    inputs = [
        grpcclient.InferInput("predict", (1, 4), "FP64"),
    ]
    inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"))
    
    outputs = [grpcclient.InferRequestedOutput("predict")]
    
    result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
    result.as_numpy("predict")
    array([[2]])
    
    model_name = "iris-pipeline.pipeline"
    headers = {"seldon-model": model_name}
    
    inputs = [
        grpcclient.InferInput("predict", (1, 4), "FP64"),
    ]
    inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"))
    
    outputs = [grpcclient.InferRequestedOutput("predict")]
    
    result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
    result.as_numpy("predict")
    array([[2]])
    
    cat models/tfsimple1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    cat pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl apply -f models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl apply -f pipelines/tfsimple.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 created
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s pipelines tfsimple -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 condition met
    pipeline.mlops.seldon.io/tfsimple condition met
    
    import tritonclient.http as httpclient
    import numpy as np
    
    http_triton_client = httpclient.InferenceServerClient(
        url=f"{MESH_IP}:80",
        verbose=False,
    )
    
    print("model ready:", http_triton_client.is_model_ready("iris"))
    print("model metadata:", http_triton_client.get_model_metadata("iris"))
    model ready: True
    model metadata: {'name': 'iris_1', 'versions': [], 'platform': '', 'inputs': [], 'outputs': [], 'parameters': {}}
    
    # Against model (no binary data)
    
    binary_data = False
    
    inputs = [
        httpclient.InferInput("INPUT0", (1, 16), "INT32"),
        httpclient.InferInput("INPUT1", (1, 16), "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    
    outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
    
    result = http_triton_client.infer("tfsimple1", inputs, outputs=outputs)
    result.as_numpy("OUTPUT0")
    array([[ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
          dtype=int32)
    
    # Against model (with binary data)
    
    binary_data = True
    
    inputs = [
        httpclient.InferInput("INPUT0", (1, 16), "INT32"),
        httpclient.InferInput("INPUT1", (1, 16), "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    
    outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
    
    result = http_triton_client.infer("tfsimple1", inputs, outputs=outputs)
    result.as_numpy("OUTPUT0")
    array([[ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
          dtype=int32)
    
    # Against Pipeline (no binary data)
    
    binary_data = False
    
    inputs = [
        httpclient.InferInput("INPUT0", (1, 16), "INT32"),
        httpclient.InferInput("INPUT1", (1, 16), "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    
    outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
    
    result = http_triton_client.infer("tfsimple.pipeline", inputs, outputs=outputs)
    result.as_numpy("OUTPUT0")
    array([[ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
          dtype=int32)
    
    ## binary data does not work with http behind pipeline
    
    # import numpy as np
    
    # binary_data = True
    
    # inputs = [
    #     httpclient.InferInput("INPUT0", (1, 16), "INT32"),
    #     httpclient.InferInput("INPUT1", (1, 16), "INT32"),
    # ]
    # inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    # inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
    
    # outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
    
    # result = http_triton_client.infer("tfsimple.pipeline", inputs, outputs=outputs)
    # result.as_numpy("OUTPUT0")
    import tritonclient.grpc as grpcclient
    import numpy as np
    
    grpc_triton_client = grpcclient.InferenceServerClient(
        url=f"{MESH_IP}:80",
        verbose=False,
    )
    model_name = "tfsimple1"
    headers = {"seldon-model": model_name}
    
    print("model ready:", grpc_triton_client.is_model_ready(model_name, headers=headers))
    print(grpc_triton_client.get_model_metadata(model_name, headers=headers))
    model ready: True
    name: "tfsimple1_1"
    versions: "1"
    platform: "tensorflow_graphdef"
    inputs {
      name: "INPUT0"
      datatype: "INT32"
      shape: -1
      shape: 16
    }
    inputs {
      name: "INPUT1"
      datatype: "INT32"
      shape: -1
      shape: 16
    }
    outputs {
      name: "OUTPUT0"
      datatype: "INT32"
      shape: -1
      shape: 16
    }
    outputs {
      name: "OUTPUT1"
      datatype: "INT32"
      shape: -1
      shape: 16
    }
    
    # Against Model
    
    model_name = "tfsimple1"
    headers = {"seldon-model": model_name}
    
    inputs = [
        grpcclient.InferInput("INPUT0", (1, 16), "INT32"),
        grpcclient.InferInput("INPUT1", (1, 16), "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
    inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
    
    outputs = [grpcclient.InferRequestedOutput("OUTPUT0")]
    
    result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
    result.as_numpy("OUTPUT0")
    array([[ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
          dtype=int32)
    
    # Against Pipeline
    
    model_name = "tfsimple.pipeline"
    headers = {"seldon-model": model_name}
    
    inputs = [
        grpcclient.InferInput("INPUT0", (1, 16), "INT32"),
        grpcclient.InferInput("INPUT1", (1, 16), "INT32"),
    ]
    inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
    inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
    
    outputs = [grpcclient.InferRequestedOutput("OUTPUT0")]
    
    result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
    result.as_numpy("OUTPUT0")
    array([[ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
          dtype=int32)
    
    kubectl delete -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    kubectl delete -f pipelines/iris.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    pipeline.mlops.seldon.io "iris-pipeline" deleted
    
    kubectl delete -f models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl delete -f pipelines/tfsimple.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "tfsimple1" deleted
    pipeline.mlops.seldon.io "tfsimple" deleted
    
    from prometheus_client.parser import text_string_to_metric_families
    import requests
    
    def scrape_metrics(host):
        data = requests.get(f"http://{host}/metrics").text
        return {
            family.name: family for family in text_string_to_metric_families(data)
        }
    
    def print_sample(family, label, value):
        for sample in family.samples:
            if sample.labels[label] == value:
                print(sample)
    
    def get_model_infer_count(host, model_name):
        metrics = scrape_metrics(host)
        family = metrics["seldon_model_infer"]
        print_sample(family, "model", model_name)
    
    def get_pipeline_infer_count(host, pipeline_name):
        metrics = scrape_metrics(host)
        family = metrics["seldon_pipeline_infer"]
        print_sample(family, "pipeline", pipeline_name)
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris --inference-mode grpc -i 100 \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'
    Success: map[:iris_1::100]
    
    get_model_infer_count(mlserver_metrics_host,"iris")
    Sample(name='seldon_model_infer_total', labels={'code': '200', 'method_type': 'rest', 'model': 'iris', 'model_internal': 'iris_1', 'server': 'mlserver', 'server_replica': '0'}, value=50.0, timestamp=None, exemplar=None)
    Sample(name='seldon_model_infer_total', labels={'code': 'OK', 'method_type': 'grpc', 'model': 'iris', 'model_internal': 'iris_1', 'server': 'mlserver', 'server_replica': '0'}, value=100.0, timestamp=None, exemplar=None)
    seldon model unload iris
    {}
    
    seldon model load -f ./models/tfsimple1.yaml
    seldon model status tfsimple1 -w ModelAvailable | jq -M .
    {}
    {}
    
    seldon model infer tfsimple1 -i 50\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    Success: map[:tfsimple1_1::50]
    
    seldon model infer tfsimple1 --inference-mode grpc -i 100 \
        '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'
    Success: map[:tfsimple1_1::100]
    
    get_model_infer_count(triton_metrics_host,"tfsimple1")
    Sample(name='seldon_model_infer_total', labels={'code': '200', 'method_type': 'rest', 'model': 'tfsimple1', 'model_internal': 'tfsimple1_1', 'server': 'triton', 'server_replica': '0'}, value=50.0, timestamp=None, exemplar=None)
    Sample(name='seldon_model_infer_total', labels={'code': 'OK', 'method_type': 'grpc', 'model': 'tfsimple1', 'model_internal': 'tfsimple1_1', 'server': 'triton', 'server_replica': '0'}, value=100.0, timestamp=None, exemplar=None)
    
    seldon model unload tfsimple1
    {}
    
    seldon model load -f ./models/tfsimple1.yaml
    seldon model load -f ./models/tfsimple2.yaml
    seldon model status tfsimple1 -w ModelAvailable | jq -M .
    seldon model status tfsimple2 -w ModelAvailable | jq -M .
    seldon pipeline load -f ./pipelines/tfsimples.yaml
    seldon pipeline status tfsimples -w PipelineReady
    {}
    {}
    {}
    {}
    {}
    {"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"cdqji39qa12c739ab3o0", "version":2, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":2, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2022-11-16T19:25:01.255955114Z"}}]}
    seldon pipeline infer tfsimples -i 50 \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    Success: map[:tfsimple1_1::50 :tfsimple2_1::50 :tfsimples.pipeline::50]
    get_pipeline_infer_count(pipeline_metrics_host,"tfsimples")
    Sample(name='seldon_pipeline_infer_total', labels={'code': '200', 'method_type': 'rest', 'pipeline': 'tfsimples', 'server': 'pipeline-gateway'}, value=50.0, timestamp=None, exemplar=None)
    seldon model unload tfsimple1
    seldon model unload tfsimple2
    seldon pipeline unload tfsimples
    {}
    {}
    {}
    # samples/models/tfsimple_scaling.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
      minReplicas: 1
      replicas: 1
    here
    here
    here
    here
    docs
    here
    here
    here

    Self-hosted Kafka

    Learn how to set up a self-hosted Kafka cluster for Seldon Core in development and learning environments.

    You can run Kafka in the same Kubernetes cluster that hosts the Seldon Core 2. Seldon recommends using the Strimzi operator for Kafka installation and maintenance. For more details about configuring Kafka with Seldon Core 2 see the Configuration section.

    Note: These instructions help you quickly set up a Kafka cluster. For production grade installation consult Strimzi documentation or use one of managed solutions.

    Integrating self-hosted Kafka with Seldon Core 2 includes these steps:

    Installing Kafka in a Kubernetes cluster

    Strimzi provides a Kubernetes Operator to deploy and manage Kafka clusters. First, we need to install the Strimzi Operator in your Kubernetes cluster.

    1. Create a namespace where you want to install Kafka. For example the namespace seldon-mesh:

    2. Install Strimzi.

    3. Install Strimzi Operator.

      This deploys the Strimzi Operator in the seldon-mesh namespace. After the Strimzi Operator is running, you can create a Kafka cluster by applying a Kafka custom resource definition.

    1. Apply the Kafka cluster configuration.

    2. Create a YAML file named kafka-nodepool.yaml to create a nodepool for the kafka cluster.

    1. Apply the Kafka node pool configuration.

    2. Check the status of the Kafka Pods to ensure they are running properly:

    Note: It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh.

    You should see multiple Pods for Kafka, and Strimzi operators running.

    Troubleshooting

    Error The Pod that begins with the name seldon-dataflow-engine does not show the status as Running.

    One of the possible reasons could be that the DNS resolution for the service failed.

    Solution

    1. Check the logs of the Pod <seldon-dataflow-engine>:

    2. In the output check if a message reads:

    3. Verify the name in the metadata for the kafka.yaml and kafka-nodepool.yaml. It should read seldon

    Configuring Seldon Core 2

    When the SeldonRuntime is installed in a namespace a ConfigMap is created with the settings for Kafka configuration. Update the ConfigMap only if you need to customize the configurations.

    1. Verify that the ConfigMap resource named seldon-kafka that is created in the namespace seldon-mesh:

      You should have the ConfigMaps for Kafka, Zookeeper, Strimzi operators, and others.

    2. View the configuration of the the ConfigMap named seldon-kafka.

      You should see an output similar to this:

    After you integrated Seldon Core 2 with Kafka, you need to that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.

    Customizing the settings (optional)

    To customize the settings you can add and modify the Kafka configuration using Helm, for example to add compression for producers.

    1. Create a YAML file to specify the compression configuration for Seldon Core 2 runtime. For example, create the values-runtime-kafka-compression.yaml file. Use your preferred text editor to create and save the file with the following content:

    1. Change to the directory that contains the values-runtime-kafka-compression.yaml file and then install Seldon Core 2 runtime in the namespace seldon-mesh.

    Configuring topic and consumer isolation (optional)

    If you are using a shared Kafka cluster with other applications, it is advisable to isolate topic names and consumer group IDs from other cluster users to prevent naming conflicts. This can be achieved by configuring the following two settings:

    • topicPrefix: set a prefix for all topics

    • consumerGroupIdPrefix: set a prefix for all consumer groups

    Here's an example of how to configure topic name and consumer group ID isolation during a Helm installation for an application named myorg:

    Next Steps

    After you installed Seldon Core 2, and Kafka using Helm, you need to complete .

    Model Performance Metrics

    Learn how to run performance tests for Seldon Core 2 deployments, including load testing, benchmarking, and analyzing inference latency and throughput metrics.

    This section describes how a user can run performance tests to understand the limits of a particular SCv2 deployment.

    The base directly is tests/k6

    Driver

    k6 is used to drive requests for load, unload and infer workloads. It is recommended that the load test is run within the same cluster that has SCv2 installed as it requires internal access to some of the services that are not automatically exposed to the outside world. Furthermore having the driver withthin the same cluster minimises link latency to SCv2 entrypoint; therefore infer latencies are more representatives of actual overheads of the system.

    Tests

    • Envoy Tests synchronous inference requests via envoy

    To run: make deploy-envoy-test

    • Agent Tests inference requests direct to a specific agent, defaults to triton-0 or mlserver-0

    To run: make deploy-rproxy-test pr make deploy-rproxy-mlserver-test

    • Server Tests inference requests direct to a specific server (bypassing agent), defaults to triton-0 or mlserver-0

    to run: make deploy-server-test or deploy-server-mlserver-test

    • Pipeline gateway (HTTP-Kafka gateway) Tests inference requests to one-node pipeline HTTP and GPRC requests

    To run: make deploy-kpipeline-test

    • Model gateway (Kafka-HTTP gateway) Tests inference requests to a model via kafka

    To run: deploy-kmodel-test

    Results

    One way to look at results is to look at the log of the pod that executed the kubernetes job.

    Results can also be persisted to a gs bucket, a service account k6-sa-key in the same namespace is required,

    Users can also look at the metrics that are exposed in prometheus while the test is underway

    Building k6 image

    In the case a user is modifying the actual scenario of the test:

    • export DOCKERHUB_USERNAME=mydockerhubaccount

    • build the k6 image via make build-push

    • in the same shell environment, deploying jobs will use this custome built docker image

    Modifying tests

    Users can modify settings of the tests in tests/k6/configs/k8s/base/k6.yaml. This will apply to all subsequent tests that are deployed using the above process.

    Settings

    Some settings that can be changed

    • k6 args

    for a full list, check

    • Environment variables

      • for MODEL_TYPE, choose from:

    Pipeline Config

    Learn how to create and manage ML pipelines in Seldon Core using Kubernetes custom resources, including model chaining and tensor mapping.

    Pipelines allow one to connect flows of inference data transformed by Model components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:

    The steps list shows three models: tfsimple1, tfsimple2 and tfsimple3. These three models each take two tensors called INPUT0 and INPUT1 of integers. The models produce two outputs OUTPUT0 (the sum of the inputs) and OUTPUT1 (subtraction of the second input from the first).

    tfsimple1 and tfsimple2 take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3 takes one V2 tensor input from each of the outputs of tfsimple1 and tfsimple2. As the outputs of tfsimple1 and tfsimple2 have tensors named OUTPUT0 and OUTPUT1 their names need to be changed to respect the expected input tensors and this is done with a tensorMap component providing this tensor renaming. This is only required if your models can not be directly chained together.

    The output of the Pipeline is the output from the tfsimple3 model.

    Support for Cyclic Pipelines

    Seldon Core 2 supports cyclic pipelines, enabling the creation of feedback loops within the inference graph. However, the cyclic pipelines should be used carefully, as incorrect configurations can lead to unintended behavior.

    The risk of joining messages from different iterations (i.e., a message from iteration t might be joined with messages from t-1, t-2, ..., 1). If a feedback message re-enters the pipeline within the join window and reaches a step already holding messages from a previous iteration, Kafka Streams may join messages across iterations. This can trigger unintended message propagation. For more details on how Kafka Streams handles joins and the implications for feedback loops, refer to this .

    Seldon Core 2 provides a maxStepRevisits parameter in the pipeline manifest. This parameter limits the number of times a step can be revisited within a single pipeline execution. If the limit is reached, the pipeline execution will terminate, returning an error. This feature is useful in cyclic pipelines, where infinite loops might occur (e.g., in agentic workflows where control flow is determined by an LLM). It helps safeguard against unintended infinite loops. By default, the maxStepRevisits is set to 0 (i.e., no cycles), but you can adjust it according to your use case.

    To enable a cyclic pipeline, set the allowCycles flag in your pipeline manifest:

    Detailed Specification

    The full GoLang specification for a Pipeline is shown below:

    Ingress Controller

    Learn how to configure Istio as an ingress controller for Seldon Core, including traffic management and security policies.

    An ingress controller functions as a reverse proxy and load balancer, implementing a Kubernetes Ingress. It adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.

    Seldon Core 2 works seamlessly with any service mesh or ingress controller, offering flexibility in your deployment setup. This guide provides detailed instructions for installing and configuring Istio with Seldon Core 2.

    Istio

    Istio implements the Kubernetes ingress resource to expose a service and make it accessible from outside the cluster. You can install Istio in either a self-hosted Kubernetes cluster or a managed Kubernetes service provided by a cloud provider that is running the Seldon Core 2.

    Prerequisites

    • Install.

    • Ensure that you install a version of Istio that is compatible with your Kubernetes cluster version. For detailed information on supported versions, refer to the .

    Installing Istio ingress controller

    Installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2 involves these tasks:

    Install Istio

    1. Add the Istio Helm charts repository and update it:

    2. Create the istio-system namespace where Istio components are installed:

    3. Install the base component:

    4. Install Istiod, the Istio control plane:

    Install Istio Ingress Gateway

    1. Install Istio Ingress Gateway:

    2. Verify that Istio Ingress Gateway is installed:

      This should return details of the Istio Ingress Gateway, including the external IP address.

    3. Verify that all Istio Pods are running:

      The output is similar to:

    Expose Seldon mesh service

    It is important to expose seldon-service service to enable communication between deployed machine learning models and external clients or services. The Seldon Core 2 inference API is exposed through the seldon-mesh service in the seldon-mesh namespace. If you install Core 2 in multiple namespaces, you need to expose the seldon-mesh service in each of namespace.

    1. Verify if the seldon-mesh service is running for example, in the namespace seldon.

      When the services are running you should see something similar to this:

    2. Create a YAML file to create a VirtualService named iris-route in the namespace seldon-mesh. For example, create the seldon-mesh-vs.yaml file. Use your preferred text editor to create and save the file with the following content:

    Next Steps

    Additional Resources

    Pipelines

    This section covers various aspects of optimizing pipeline performance in Seldon Core 2, from testing methodologies to Core 2 configuration. Each subsection provides detailed guidance on different aspects of pipeline performance tuning:

    Testing Pipelines

    Learn how to effectively test and optimize pipeline performance:

    • Understanding pipeline latency components

    • Identifying and reducing bottlenecks

    • Balancing model replicas and resources

    Explore how to configure Core 2 components for optimal pipeline performance:

    • Understanding Core 2 data processing components

    • Optimizing Kafka integration

    • Configuring Core 2 services

    Understand how Core 2 components scale with the number of deployed pipelines and models:

    • Dynamic scaling of dataflow engine, model gateway, and pipeline gateway

    • Loading and unloading of models and pipelines

    • Assignment of pipelines and models to replicas

    Each of these aspects plays a crucial role in achieving optimal pipeline performance. We recommend starting with testing individual models in your pipeline, then using those insights to inform your Core 2 configuration and overall pipeline optimization strategies.

    https://github.com/SeldonIO/seldon-core/blob/v2/samples/experiments/addmul10.yaml
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/experiments/ab.yaml

    Cyclic Pipeline

    Learn how to deploy a cyclic pipeline using Core 2. In this example, you'll build a simple counter that begins at a user-defined starting value and increments by one until it reaches 10. If the starting value is already greater than 10, the pipeline terminates immediately without running.

    Before you begin

    1. Ensure that you have in the namespace seldon-mesh.

    Tracing

    This guide walks you through setting up Jaeger Tracing for Seldon Core v2 on Kubernetes. By the end of this guide, you will be able to visualize inference traces through your Core 2 components.

    Prerequisites

    • Set up and connect to a Kubernetes cluster running version 1.27 or later. For instructions on connecting to your Kubernetes cluster, refer to the documentation provided by your cloud provider.

    Dataflow with Kafka

    Explore how Seldon Core 2 uses data flow paradigm and Kafka-based streaming to improve ML model deployment with better scalability, fault tolerance, and data observability.

    Seldon Core 2 is designed around data flow paradigm. Here we will explain what that means and some of the rationals behind this choice.

    Seldon Core v1

    Initial release of Seldon Core introduced a concept of an inference graph, which can be thought of as a sequence of operations that happen to the inference request. Here is how it may look like:

    In reality though this was not how Seldon Core v1 is implemented. Instead, Seldon deployment consists of a range of independent services that host models, transformations, detectors and explainers, and a central orchestrator that knows the inference graph topology and makes service calls in the correct order, passing data between requests and responses as necessary. Here is how the picture looks under the hood:

    !kubectl get servers -n seldon-mesh
    NAME              READY   REPLICAS   LOADED MODELS   AGE
    mlserver          True    1          0               156d
    mlserver-custom   True    1          0               38d
    triton            True    1          0               156d
    !cat  ../../../samples/quickstart/servers/mlserver-custom.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-custom
    spec:
      serverConfig: mlserver
      capabilities:
      - income-classifier-deps
      podSpec:
        containers:
        - image: seldonio/mlserver:1.6.0
          name: mlserver
    !kubectl apply  -f ../../../samples/quickstart/servers/mlserver-custom.yaml -n seldon-mesh
    server.mlops.seldon.io/mlserver-custom unchanged
    !gcloud storage ls --recursive gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier
    gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/:
    gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model-settings.json
    gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model.joblib
    
    
    Updates are available for some Google Cloud CLI components.  To install them,
    please run:
      $ gcloud components update
    
    
    
    To take a quick anonymous survey, run:
      $ gcloud survey
    !gsutil cat gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model-settings.json 
    {
        "name": "income",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "uri": "./model.joblib",
            "version": "v0.1.0"
        }
    }
    !cat ../../../samples/quickstart/models/sklearn-income-classifier.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-classifier
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier"
      requirements:
      - income-classifier-deps
      memory: 100Ki
    !kubectl apply -f ../../../samples/quickstart/models/sklearn-income-classifier.yaml -n seldon-mesh
    model.mlops.seldon.io/income-classifier created
    MESH_IP = !kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP = MESH_IP[0]
    MESH_IP
    '34.32.149.48'
    endpoint = f"http://{MESH_IP}/v2/models/income-classifier/infer"
    headers = {
        "Seldon-Model": "income-classifier",
    }
    inference_request = {
      "inputs": [
        {
          "name": "income",
          "datatype": "INT64",
          "shape": [1, 12],
          "data": [53, 4, 0, 2, 8, 4, 2, 0, 0, 0, 60, 9]
        }
      ]
    }
    import requests
    response = requests.post(endpoint, headers=headers, json=inference_request)
    response.json()
    {'model_name': 'income-classifier_1',
     'model_version': '1',
     'id': '626ebe8e-bc95-433f-8f5f-ef296625622a',
     'parameters': {},
     'outputs': [{'name': 'predict',
       'shape': [1, 1],
       'datatype': 'INT64',
       'parameters': {'content_type': 'np'},
       'data': [0]}]}
    import re
    import numpy as np
    
    # Extracts numerical values from a formatted text and outputs a vector of numerical values.
    def extract_numerical_values(input_text):
    
        # Find key-value pairs in text
        pattern = r'"[^"]+":\s*"([^"]+)"'
        matches = re.findall(pattern, input_text)
        
        # Extract numerical values
        numerical_values = []
        for value in matches:
            cleaned_value = value.replace(",", "")
            if cleaned_value.isdigit():  # Integer
                numerical_values.append(int(cleaned_value))
            else:
                try:  
                    numerical_values.append(float(cleaned_value))
                except ValueError:
                    pass  
        
        # Return array of numerical values
        return np.array(numerical_values)
    input_text = '''
    "Age": "47",
    "Workclass": "4",
    "Education": "1",
    "Marital Status": "1",
    "Occupation": "1",
    "Relationship": "0",
    "Race": "4",
    "Sex": "1",
    "Capital Gain": "0",
    "Capital Loss": "0",
    "Hours per week": "68",
    "Country": "9",
    "Name": "John Doe"
    '''
    
    extract_numerical_values(input_text)
    array([47,  4,  1,  1,  1,  0,  4,  1,  0,  0, 68,  9])
    !gcloud storage ls --recursive gs://seldon-models/scv2/samples/preprocessor
    gs://seldon-models/scv2/samples/preprocessor/:
    gs://seldon-models/scv2/samples/preprocessor/model-settings.json
    gs://seldon-models/scv2/samples/preprocessor/model.py
    gs://seldon-models/scv2/samples/preprocessor/preprocessor.yaml
    !cat ../../../samples/quickstart/models/preprocessor/preprocessor.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
        name: preprocessor
    spec:
        storageUri: "gs://seldon-models/scv2/samples/preprocessor"
        requirements:
        - income-classifier-deps
    !kubectl apply -f ../../../samples/quickstart/models/preprocessor/preprocessor.yaml -n seldon-mesh
    model.mlops.seldon.io/preprocessor created
    endpoint_pp = f"http://{MESH_IP}/v2/models/preprocessor/infer"
    headers_pp = {
        "Seldon-Model": "preprocessor",
        }
    text_inference_request = {
        "inputs": [
            {
                "name": "text_input", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": [input_text]
            }
        ]
    }
    import requests
    response = requests.post(endpoint_pp, headers=headers_pp, json=text_inference_request)
    response.json()
    {'model_name': 'preprocessor_1',
     'model_version': '1',
     'id': 'b26e49d5-2a4c-488b-8dff-0df850fbed3d',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 12],
       'datatype': 'INT64',
       'parameters': {'content_type': 'np'},
       'data': [47, 4, 1, 1, 1, 0, 4, 1, 0, 0, 68, 9]}]}
    !cat ../../../samples/quickstart/pipelines/income-classifier-app.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: income-classifier-app
    spec:
      steps:
        - name: preprocessor
        - name: income-classifier
          inputs:
          - preprocessor
      output:
        steps:
        - income-classifier
    !kubectl apply -f ../../../samples/quickstart/pipelines/income-classifier-app.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/income-classifier-app created
    pipeline_endpoint = f"http://{MESH_IP}/v2/models/income-classifier-app/infer"
    pipeline_headers = {
        "Seldon-Model": "income-classifier-app.pipeline"
    }
    pipeline_response = requests.post(
        pipeline_endpoint, json=text_inference_request, headers=pipeline_headers
    )
    pipeline_response.json()
    {'model_name': '',
     'outputs': [{'data': [0],
       'name': 'predict',
       'shape': [1, 1],
       'datatype': 'INT64',
       'parameters': {'content_type': 'np'}}]}
    !kubectl delete -f ../../../samples/quickstart/pipelines/income-classifier-app.yaml -n seldon-mesh
    !kubectl delete -f ../../../samples/quickstart/models/preprocessor/preprocessor.yaml -n seldon-mesh
    !kubectl delete -f ../../../samples/quickstart/models/sklearn-income-classifier.yaml -n seldon-mesh
    !kubectl delete -f ../../../samples/quickstart/servers/mlserver-custom.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "income-classifier-app" deleted
    model.mlops.seldon.io "preprocessor" deleted
    model.mlops.seldon.io "income-classifier" deleted
    server.mlops.seldon.io "mlserver-custom" deleted
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: addmul10
    spec:
      default: pipeline-add10
      resourceType: pipeline
      candidates:
      - name: pipeline-add10
        weight: 50
      - name: pipeline-mul10
        weight: 50
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-iris
    spec:
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    
    Core 2 Configuration
    Scalability of Pipelines
    k6 args
    Find the IP address of the Seldon Core 2 instance running with Istio:

    Make a note of the IP address that is displayed in the output. This is the IP address that you require to test the installation.

    Create a virtual service to expose the seldon-mesh service.

    When the virtual service is created, you should see this:

    Seldon Core 2
    Istio Compatibility Matrix
    Install Istio
    Install Istio Ingress Gateway
    Expose Seldon mesh service
    Verify the installation
    Istio Documentation
    GKE Ingress Guide
    AWS Documentation
      # tests/k6/configs/k8s/base/k6.yaml
      args: [
        "--no-teardown",
        "--summary-export",
        "results/base.json",
        "--out",
        "csv=results/base.gz",
        "-u",
        "5",
        "-i",
        "100000",
        "-d",
        "120m",
        "scenarios/infer_constant_vu.js",
        ]
      # # infer_constant_rate
      # args: [
      #   "--no-teardown",
      #   "--summary-export",
      #   "results/base.json",
      #   "--out",
      #   "csv=results/base.gz",
      #   "scenarios/infer_constant_rate.js",
      #   ]
      # # k8s-test-script
      # args: [
      #   "--summary-export",
      #   "results/base.json",
      #   "--out",
      #   "csv=results/base.gz",
      #   "scenarios/k8s-test-script.js",
      #   ]
      # # core2_qa_control_plane_ops
      # args: [
      #   "--no-teardown",
      #   "--verbose",
      #   "--summary-export",
      #   "results/base.json",
      #   "--out",
      #   "csv=results/base.gz",
      #   "-u",
      #   "5",
      #   "-i",
      #   "10000",
      #   "-d",
      #   "9h",
      #   "scenarios/core2_qa_control_plane_ops.js",
      #   ]
      - name: INFER_HTTP_ITERATIONS
        value: "1"
      - name: INFER_GRPC_ITERATIONS
        value: "1"
      - name: MODELNAME_PREFIX
        value: "tfsimplea,pytorch-cifar10a,tfmnista,mlflow-winea,irisa"
      - name: MODEL_TYPE
        value: "tfsimple,pytorch_cifar10,tfmnist,mlflow_wine,iris"
      # Specify MODEL_MEMORY_BYTES using unit-of measure suffixes (k, M, G, T)
      # rather than numbers without units of measure. If supplying "naked
      # numbers", the seldon operator will take care of converting the number
      # for you but also take ownership of the field (as FieldManager), so the
      # next time you run the scenario creating/updating of the model CR will
      # fail.
      - name: MODEL_MEMORY_BYTES
        value: "400k,8M,43M,200k,3M"
      - name: MAX_MEM_UPDATE_FRACTION
        value: "0.1"
      - name: MAX_NUM_MODELS
        value: "800,100,25,100,100"
        # value: "0,0,25,100,100"
      #
      # MAX_NUM_MODELS_HEADROOM is a variable used by control-plane tests.
      # It's the approximate number of models that can be created over
      # MAX_NUM_MODELS over the experiment. In the worst case scenario
      # (very unlikely) the HEADROOM values may temporarily exceed the ones
      # specified here with the number of VUs, because each VU checks the
      # headroom constraint independently before deciding on the available
      # operations (no communication/sync between VUs)
      # - name: MAX_NUM_MODELS_HEADROOM
      #   value: "20,5,0,20,30"
      #
      # MAX_MODEL_REPLICAS is used by control-plane tests. It controls the
      # maximum number of replicas that may be requested when
      # creating/updating models of a given type.
      # - name: MAX_MODEL_REPLICAS
      #   value: "2,2,0,2,2"
      #
      - name: INFER_BATCH_SIZE
        value: "1,1,1,1,1"
      # MODEL_CREATE_UPDATE_DELETE_BIAS defines the probability ratios between
      # the operations, for control-plane tests. For example, "1, 4, 3"
      # makes an Update four times more likely then a Create, and a Delete 3
      # times more likely than the Create.
      # - name: MODEL_CREATE_UPDATE_DELETE_BIAS
      #   value: "1,3,1"
      - name: WARMUP
        value: "false"
    // tests/k6/components/model.js
      import { dump as yamlDump } from "https://cdn.jsdelivr.net/npm/[email protected]/dist/js-yaml.mjs";
      import { getConfig } from '../components/settings.js'
    
      const tfsimple_string = "tfsimple_string"
      const tfsimple = "tfsimple"
      const iris = "iris"  // mlserver
      const pytorch_cifar10 = "pytorch_cifar10"
      const tfmnist = "tfmnist"
      const tfresnet152 = "tfresnet152"
      const onnx_gpt2 = "onnx_gpt2"
      const mlflow_wine = "mlflow_wine" // mlserver
      const add10 = "add10" // https://github.com/SeldonIO/triton-python-examples/tree/master/add10
      const sentiment = "sentiment" // mlserver
    kubectl apply -f seldon-mesh-vs.yaml
    virtualservice.networking.istio.io/iris-route created
    helm repo add istio https://istio-release.storage.googleapis.com/charts
    helm repo update
    kubectl create namespace istio-system
    helm install istio-base istio/base -n istio-system
    helm install istiod istio/istiod -n istio-system --wait
    helm install istio-ingressgateway istio/gateway -n istio-system
    kubectl get svc istio-ingressgateway -n istio-system
    kubectl get pods -n istio-system
    NAME                          READY   STATUS    RESTARTS   AGE
    istiod-xxxxxxx-xxxxx          1/1     Running   0          2m
    istio-ingressgateway-xxxxx    1/1     Running   0          2m
    kubectl get svc -n seldon-mesh
    mlserver-0               ClusterIP      None             <none>          9000/TCP,9500/TCP,9005/TCP                                                                  43m
    seldon-mesh              LoadBalancer   34.118.225.130   34.90.213.15    80:32228/TCP,9003:31265/TCP                                                                 45m
    seldon-pipelinegateway   ClusterIP      None             <none>          9010/TCP,9011/TCP                                                                           45m
    seldon-scheduler         LoadBalancer   34.118.225.138   35.204.34.162   9002:32099/TCP,9004:32100/TCP,9044:30342/TCP,9005:30473/TCP,9055:32732/TCP,9008:32716/TCP   45m
    triton-0                 ClusterIP      None             <none>          9000/TCP,9500/TCP,9005/TCP 
    apiVersion: networking.istio.io/v1alpha3
    kind: VirtualService
    metadata:
      name: iris-route
      namespace: seldon-mesh
    spec:
      gateways:
        - istio-system/seldon-gateway
      hosts:
        - "*"
      http:
        - name: iris-http
          match:
            - uri:
                prefix: /v2
          route:
            - destination:
                host: seldon-mesh.seldon-mesh.svc.cluster.local
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    

    Create a YAML file to specify the initial configuration.

    Note: This configuration sets up a Kafka cluster with version 3.9.0. Ensure that you review the the supported versions of Kafka and update the version in the kafka.yaml file as needed. For more configuration examples, see this strimzi-kafka-operator.

    Use your preferred text editor to create and save the file as kafka.yaml with the following content:

    .
  • Check the name of the Kafka services in the namespace:

  • Restart the Pod:

  • Install Kafka
    Configure Seldon Core 2
    Install an Ingress Controller
    Installing a Service mesh

    Ensure that you are performing these steps in the directory where you have downloaded the samples.

  • Get the IP address of the Seldon Core 2 instance running with Istio:

  • Make a note of the IP address that is displayed in the output. Replace <INGRESS_IP> with your service mesh's ingress IP address in the following commands.

    Models

    Start by implementing the first model: a simple counter.

    This model produces two output tensors. The first contains the incremented number, while the second is an empty tensor labeled either continue or stop. This second tensor acts as a trigger, directing the data flow through either the feedback loop or the output path. For more information on triggering tensors, see the intro to pipelines page.

    Next, define the second model — an identity model:

    The identity model simply passes the input tensors through to the output while introducing a delay. This delay is crucial for preventing infinite loops in the pipeline, which can occur due to the join interval behavior in Kafka Streams. For further details, see Kafka documentation.

    Pipeline

    This counter application pipeline consists of three models: the counter model, an identity model for the feedback loop, and another identity model for the output. The structure of the pipeline is illustrated as follows:

    Models deployment

    To deploy the pipeline, you need to load each model into the cluster. The model-settings.json configuration for the counter model is as follows:

    For the identity feedback loop model, reuse the model-settings.json file and configure it to include a 1-millisecond delay:

    The one-millisecond delay is crucial to prevent infinite loops in the pipeline. It aligns with the join window applied to all input types for the counter model, as well as the join window configured for the identity model, which is specified in the pipeline definition.

    Similarly, for the identity output model, reuse the same model-settings.json file without introducing any delay.

    The manifest files for the three models are the following:

    To deploy the models, use the following command:

    Pipeline deployment

    After the models are deployed, proceed to deploy the pipeline. The pipeline manifest file is defined as follows:

    Note: that the joinWindowMs parameter is set to 1 millisecond for both the identity loop and identity output models. This setting is essential to prevent messages from different iterations from being joined (e.g., a message from iteration t being joined with messages from iterations t-1, t-2, ..., 1). Additionally, we limit the number of step revisits to 100 — the maximum number of times the pipeline can revisit a step during execution. While our pipeline behaves deterministically and is guaranteed to terminate, this parameter is especially useful in cyclic pipelines where a terminal state might not be reached (e.g., agentic workflows where control flow is determined by an LLM). It helps safeguard against infinite loops.

    To deploy the pipeline, use the following command:

    Testing the pipeline

    To send a request to the pipeline, use the following command:

    This request initiates the pipeline with an input value of 0. The pipeline increments this value step by step until it reaches 10, at which point it stops. The response includes the final counter value, 10, along with a message indicating that the pipeline has terminated.

    Cleanup

    To clean up the models and the pipeline, use the following commands:

    installed Seldon Core 2
    Install kubectl, the Kubernetes command-line tool.
  • Install Helm, the package manager for Kubernetes.

  • Install Seldon Core 2

  • Install cert-manager in the namespace cert-manager.

  • To set up Jaeger Tracing for Seldon Core 2 on Kubernetes and visualize inference traces of the Seldon Core 2 components. You need to do the following:

    1. Create a namespace

    2. Install Jaeger Operator

    3. Deploy a Jaeger instance

    4. Configure Core 2

    Create a namespace

    Create a dedicated namespace to install the Jaeger Operator and tracing resources:

    Install Jaeger Operator

    The Jaeger Operator manages Jaeger instances in the Kubernetes cluster. Use the Helm chart for Jaeger v2.

    1. Add the Jaeger to the Helm repository:

    1. Create a minimal tracing-values.yaml:

    1. Install or upgrade the Jaeger Operator in the tracing namespace:

    1. Validate that the Jaeger Operator Pod is running:

    Output is similar to:

    Deploy a minimal Jaeger instance

    Install a simple Jaeger custom resource in the namespace seldon-mesh, where Seldon Core 2 is running.

    This CR is suitable for local development, demos, and quick-start scenarios. It is not recommended for production because all components and trace data are ephemeral.

    1. Create a manifest file named jaeger-simplest.yaml with these contents:

    1. Apply the manifest:

    1. Verify that the Jaeger all-in-one pod is running:

    Output is similar to:

    This simplest Jaeger CR does the following:

    • All-in-one pod: Deploys a single pod running the collector, agent, query service, and UI, using in-memory storage.

    • Core 2 integration: receives spans from Seldon Core 2 components and exposes a UI for viewing traces.

    Configure Seldon Core 2

    To enable tracing, configure the OpenTelemetry exporter endpoint in the SeldonRuntime resource so that traces are sent to the Jaeger collector service created by the simplest Jaeger Custom Resource. The Seldon Runtime helm chart is located here.

    1. Find the seldonruntime Custom Resource that needs to be updated using: kubectl get seldonruntimes -n seldon-mesh

    2. Patch your Custom Resource to include tracingConfig under spec.config using:

    Output is similar to:

    1. Check the updated .yaml file, using: kubectl get seldonruntime seldon -n seldon-mesh -o yaml

    Output is similar to:

    1. Restart the following Core 2 component Pods so they pick up the new tracing configuration from the seldon-tracing ConfigMap in the seldon-mesh namespace.

    • seldon-dataflow-engine

    • seldon-pipeline-gateway

    • seldon-model-gateway

    • seldon-scheduler

    • Servers

    After restart, these components reads the updated tracing config and start emitting traces to Jaeger.

    Generate traffic

    To visualize traces, send requests to your models or pipelines deployed in Seldon Core 2. Each inference request should produce a trace that shows the path through the Core 2 components such as gateways, dataflow engine, server agents in the Jaeger UI.

    Access the Jaeger UI

    1. Port-forward the Jaeger query service to your local machine:

    1. Open the Jaeger UI in your browser:

    You can now explore traces emitted by Seldon Core 2 components.

    An example Jaeger trace is shown below:

    trace

    While this is a convenient way of implementing evaluation graph with microservices, it has a few problems. Orchestrator becomes a bottleneck and a single point of failure. It also hides all the data transformations that need to happen to translate one service's response to another service's request. Data tracing and lineage becomes difficult. All in all, while Seldon platform is all about processing data, under-the-hood implementation was still focused on order of operations and not on data itself.

    Data flow

    The realisation of this disparity led to a new approach towards inference graph evaluation in v2, based on the data flow paradigm. Data flow is a well known concept in software engineering, known from 1960s. In contrast to services, that model programs as a control flow, focusing on the order of operations, data flow proposes to model software systems as a series of connections that modify incoming data, focusing on data flowing through the system. A particular flavor of data flow paradigm used by v2 is known as flow-based programming, FBP. FBP defines software applications as a set of processes which exchange data via connections that are external to those processes. Connections are made via named ports, which promotes data coupling between components of the system.

    Data flow design makes data in software the top priority. That is one of the key messages of the so called "data-centric AI" idea, which is becoming increasingly popular within the ML community. Data is a key component of a successful ML project. Data needs to be discovered, described, cleaned, understood, monitored and verified. Consequently, there is a growing demand for data-centric platforms and solutions. Making Seldon Core data-centric was one of the key goals of the Seldon Core 2 design.

    Seldon Core 2

    In the context of Seldon Core application of FBP design approach means that the evaluation implementation is done the same way inferece graph. So instead of routing everything through a centralized orchestrator the evaluation happens in the same graph-like manner:

    As far as implementation goes, Seldon Core 2 runs on Kafka. Inference request is put onto a pipeline input topic, which triggers an evaluation. Each part of the inference graph is a service running in its own container fronted by a model gateway. Model gateway listens to a corresponding input Kafka topic, reads data from it, calls the service and puts the received response to an output Kafka topic. There is also a pipeline gateway that allows to interact with Seldon Core in synchronous manner.

    This approach gives SCv2 several important features. Firstly, Seldon Core natively supports both synchronous and asynchronous modes of operation. Asynchronicity is achieved via streaming: input data can be sent to an input topic in Kafka, and after the evaluation the output topic will contain the inference result. For those looking to use it in the v1 style, a service API is provided.

    Secondly, there is no single point of failure. Even if one or more nodes in the graph go down, the data will still be sitting on the streams waiting to be processed, and the evaluation resumes whenever the failed node comes back up.

    Thirdly, data flow means intermediate data can be accessed at any arbitrary step of the graph, inspected and collected as necessary. Data lineage is possible throughout, which opens up opportunities for advanced monitoring and explainability use cases. This is a key feature for effective error surfacing in production environments as it allows:

    • Adding context from different parts of the graph to better understand a particular output

    • Reducing false positive rates of alerts as different slices of the data can be investigated

    • Enabling reproducing of results as fined-grained lineage of computation and associated data transformation are tracked by design

    Finally, inference graph can now be extended with adding new nodes at arbitrary places, all without affecting the pipeline execution. This kind of flexibility was not possible with v1. This also allows multiple pipelines to share common nodes and therefore optimising resources usage.

    References

    More details and information on data-centric AI and data flow paradigm can be found in these resources:

    • Data-centric AI Resource Hub

    • Stanford MLSys seminar "What can Data-Centric AI Learn from Data and ML Engineering?"

    • A paper that explores data flow in ML deployment context

    • Introduction to flow based programming from its creator J.P. Morrison:

    • Pathways: Asynchronous Distributed Dataflow for from Google on the design and implementation of data flow based orchestration layer for accelerators

    • Better understanding of data requires tracking its

    Confluent blog post

    Managed Kafka

    Guide to integrating managed Kafka services (Confluent Cloud, Amazon MSK, Azure Event Hub) with Seldon Core 2, including security configurations and authentication setup.

    Seldon recommends a managed Kafka service for production installation. You can integrate and secure your managed Kafka Seldon Core 2.

    Some of the Managed Kafka services that are tested are:

    • Confluent Cloud (security: SASL/PLAIN)

    • Confluent Cloud (security: SASL/OAUTHBEARER)

    Exposing Metrics for HPA

    Learn how to implement request-per-second (RPS) based autoscaling in Seldon Core 2 using Kubernetes HPA and Prometheus metrics.

    Given Seldon Core 2 is predominantly for serving ML in Kubernetes, it is possible to leverage HorizontalPodAutoscaler or to define scaling logic automatically scale up and down Kubernetes resources. This requires exposing metrics such that they can be used by HPA. In this tutorial, we will explain how to expose a metric (requests per second) using Prometheus and , such that it can be used to autoscale Models or Servers using HPA.

    The following workflow will require:

    • Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring namespace.

    Autoscaling Servers

    Learn how to leverage Core 2's native autoscaling functionality for Servers

    Core 2 runs with long-lived server replicas, each able to host multiple models (through Multi-Model Serving, or MMS). The server replicas can be autoscaled natively by Core 2 in response to dynamic changes in the requested number of model replicas, allowing users to seamlessly optimize the infrastructure cost associated with their deployments.

    This document outlines the autoscaling policies and mechanisms that are available for autoscaling server replicas. These policies are designed to ensure that the server replicas are scaled up or down in response to changes in the number of replicas requested for each model. In other words if a given model is scaled up, the system will scale up the server replicas in order to host the new model replicas. Similarly, if a given model is scaled down, the system may scale down the number of replicas of the server hosting the model, depending on other models that are loaded on the same server replica.

    Note: Autoscaling of servers is required in the case of Multi-Model Serving as the models are dynamically loaded and unloaded onto these server replicas. In this case Core 2 would autoscale server replicas according to changes to the model replicas that are required. This is in contrast to single-model autoscaling approach explained

    https://github.com/SeldonIO/seldon-core/blob/v2/samples/experiments/sklearn-mirror.yaml
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/experiments/ab-default-model.yaml
    kubectl get svc -n seldon-mesh
    kubectl delete pod <seldon-dataflow-engine> -n seldon-mesh 
    kubectl create namespace seldon-mesh || echo "namespace seldon-mesh exists"
    helm repo add strimzi https://strimzi.io/charts/
    helm repo update
    helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator --namespace seldon-mesh
    apiVersion: kafka.strimzi.io/v1beta2
    kind: Kafka
    metadata:
      name: seldon
      namespace: seldon-mesh
      annotations:
        strimzi.io/node-pools: enabled
        strimzi.io/kraft: enabled
    spec:
      kafka:
        replicas: 3
        version: 3.9.0
        listeners:
          - name: plain
            port: 9092
            tls: false
            type: internal
          - name: tls
            port: 9093
            tls: true
            type: internal
        config:
          processMode: kraft
          auto.create.topics.enable: true
          default.replication.factor: 1
          inter.broker.protocol.version: 3.7
          min.insync.replicas: 1
          offsets.topic.replication.factor: 1
          transaction.state.log.min.isr: 1
          transaction.state.log.replication.factor: 1
      entityOperator: null
    kubectl apply -f kafka.yaml -n seldon-mesh
    apiVersion: kafka.strimzi.io/v1beta2
    kind: KafkaNodePool
    metadata:
      name: kafka
      namespace: seldon-mesh
      labels:
        strimzi.io/cluster: seldon
    spec:
      replicas: 3
      roles:
        - broker
        - controller
      resources:
        requests:
          cpu: '500m'
          memory: '2Gi'
        limits:
          memory: '2Gi'
      template:
        pod:
          tmpDirSizeLimit: 1Gi
      storage:
        type: jbod
        volumes:
          - id: 0
            type: ephemeral
            sizeLimit: 500Mi
            kraftMetadata: shared
          - id: 1
            type: persistent-claim
            size: 10Gi
            deleteClaim: false
    kubectl apply -f kafka-nodepool.yaml -n seldon-mesh
    kubectl get pods -n seldon-mesh
    ```bash
    NAME                                            READY   STATUS    RESTARTS      AGE
    hodometer-5489f768bf-9xnmd                      1/1     Running   0             25m
    mlserver-0                                      3/3     Running   0             24m
    seldon-dataflow-engine-75f9bf6d8f-2blgt         1/1     Running   5 (23m ago)   25m
    seldon-envoy-7c764cc88-xg24l                    1/1     Running   0             25m
    seldon-kafka-0                                  1/1     Running   0             21m
    seldon-kafka-1                                  1/1     Running   0             21m
    seldon-kafka-2                                  1/1     Running   0             21m
    seldon-modelgateway-54d457794-x4nzq             1/1     Running   0             25m
    seldon-pipelinegateway-6957c5f9dc-6blx6         1/1     Running   0             25m
    seldon-scheduler-0                              1/1     Running   0             25m
    seldon-v2-controller-manager-7b5df98677-4jbpp   1/1     Running   0             25m
    strimzi-cluster-operator-66b5ff8bbb-qnr4l       1/1     Running   0             23m
    triton-0                                        3/3     Running   0             24m
    ```
    kubectl logs <seldon-dataflow-engine> -n seldon-mesh
    WARN [main] org.apache.kafka.clients.ClientUtils : Couldn't resolve server seldon-kafka-bootstrap.seldon-mesh:9092 from bootstrap.servers as DNS resolution failed for seldon-kafka-bootstrap.seldon-mesh
    kubectl get configmaps -n seldon-mesh
    NAME                       DATA   AGE
    kube-root-ca.crt           1      38m
    seldon-agent               1      30m
    seldon-kafka               1      30m
    seldon-kafka-0             6      26m
    seldon-kafka-1             6      26m
    seldon-kafka-2             6      26m
    seldon-manager-config      1      30m
    seldon-tracing             4      30m
    strimzi-cluster-operator   1      28m
    kubectl get configmap seldon-kafka -n seldon-mesh -o yaml
    apiVersion: v1
    data:
      kafka.json: '{"bootstrap.servers":"seldon-kafka-bootstrap.seldon-mesh:9092","consumer":{"auto.offset.reset":"earliest","message.max.bytes":"1000000000","session.timeout.ms":"6000","topic.metadata.propagation.max.ms":"300000"},"producer":{"linger.ms":"0","message.max.bytes":"1000000000"},"topicPrefix":"seldon"}'
    kind: ConfigMap
    metadata:
      creationTimestamp: "2024-12-05T07:12:57Z"
      name: seldon-kafka
      namespace: seldon-mesh
      ownerReferences:
      - apiVersion: mlops.seldon.io/v1alpha1
        blockOwnerDeletion: true
        controller: true
        kind: SeldonRuntime
        name: seldon
        uid: 9e724536-2487-487b-9250-8bcd57fc52bb
      resourceVersion: "778"
      uid: 5c041e69-f36b-4f14-8f0d-c8790003cb3e
    helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
    --namespace seldon-mesh \
    -f values-runtime-kafka-compression.yaml \
    --install
    helm upgrade --install seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
    --namespace seldon-mesh \
    --set controller.clusterwide=true \
    --set kafka.topicPrefix=myorg \
    --set kafka.consumerGroupIdPrefix=myorg
    curl -k http://<INGRESS_IP>:80/v2/models/counter-pipeline/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: counter-pipeline.pipeline" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "counter-pipeline.inputs",
            "datatype": "INT32",
            "shape": [1],
            "data": [0]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            10
          ],
          "name": "output",
          "shape": [
            1,
            1
          ],
          "datatype": "INT32",
          "parameters": {
            "content_type": "np"
          }
        },
        {
          "data": [
            ""
          ],
          "name": "stop",
          "shape": [
            1,
            1
          ],
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          }
        }
      ]
    }
    seldon pipeline infer counter-pipeline --inference-host <INGRESS_IP>:80\
      '{"inputs":[{"name":"counter-pipeline.inputs","shape":[1],"datatype":"INT32","data":[0]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            10
          ],
          "name": "output",
          "shape": [
            1,
            1
          ],
          "datatype": "INT32",
          "parameters": {
            "content_type": "np"
          }
        },
        {
          "data": [
            ""
          ],
          "name": "stop",
          "shape": [
            1,
            1
          ],
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          }
        }
      ]
    }
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    from mlserver.model import MLModel
    from mlserver.codecs import NumpyCodec, StringCodec
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.logging import logger
    
    
    class Counter(MLModel):
        async def load(self) -> bool:
            self.ready = True
            return self.ready
    
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            x = NumpyCodec.decode_input(payload.inputs[0]) + 1
            message = "continue" if x.item() < 10 else "stop"
            return InferenceResponse(
                model_name=self.name,
                model_version=self.version,
                outputs=[
                    NumpyCodec.encode_output(
                        name="output",
                        payload=x
                    ),
                    StringCodec.encode_output(
                        name=message,
                        payload=[""]
                    ),
                ]
            )
    import asyncio
    from mlserver.logging import logger
    from mlserver import MLModel, ModelSettings
    from mlserver.types import (
        InferenceRequest, InferenceResponse, ResponseOutput
    )
    
    
    class IdentityModel(MLModel):
        def __init__(self, settings: ModelSettings):
            super().__init__(settings)
            self.params = settings.parameters
            self.extra = self.params.extra if self.params is not None else None
            self.delay = self.extra.get("delay", 0)
            
    
        async def load(self) -> bool:
            self.ready = True
            return self.ready
    
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            if self.delay:
                await asyncio.sleep(self.delay)
            
            return InferenceResponse(
                model_name=self.name,
                model_version=self.version,
                outputs=[
                    ResponseOutput(
                        name=request_input.name,
                        shape=request_input.shape,
                        datatype=request_input.datatype,
                        parameters=request_input.parameters,
                        data=request_input.data
                    ) for request_input in payload.inputs
                ]
            )
    {
        "name": "counter",
        "implementation": "model.Counter",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    {
        "name": "identity-loop",
        "implementation": "model.IdentityModel",
        "parameters": {
            "version": "v0.1.0",
            "extra": {
                "delay": 0.001
            }
        }
    }
    {
        "name": "identity-output",
        "implementation": "model.IdentityModel",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    cat ./models/counter.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: counter
    spec:
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/counter"
      requirements:
      - mlserver
    cat ./models/identity-loop.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: identity-loop
    spec:
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-loop"
      requirements:
      - mlserver
    cat ./models/identity-output.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: identity-output
    spec:  
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-output"
      requirements:
      - mlserver
    kubectl apply -f - --namespace=seldon-mesh <<EOF
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: counter
    spec:
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/counter"
      requirements:
        - mlserver
    EOF
    kubectl apply -f - --namespace=seldon-mesh <<EOF
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: identity-loop
    spec:
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-loop"
      requirements:
        - mlserver
    EOF
    kubectl apply -f - --namespace=seldon-mesh <<EOF
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: identity-output
    spec:
      storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-output"
      requirements:
        - mlserver
    EOF
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    
    model.mlops.seldon.io/counter condition met
    model.mlops.seldon.io/identity-loop condition met
    model.mlops.seldon.io/identity-output condition met
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: counter-pipeline
    spec:
      allowCycles: true
      maxStepRevisits: 100
      steps:
      - name: counter
        inputsJoinType: any
        inputs:
        - counter-pipeline.inputs
        - identity-loop.outputs
      - name: identity-output
        joinWindowMs: 1
        inputs:
        - counter.outputs
        triggers:
        - counter.outputs.stop
      - name: identity-loop
        joinWindowMs: 1
        inputs:
        - counter.outputs.output
        triggers:
        - counter.outputs.continue
      output:
        steps:
        - identity-output.outputs
    kubectl create -f counter-pipeline.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/counter-pipeline created
    kubectl delete pipeline counter-pipeline -n seldon-mesh
    pipeline.mlops.seldon.io "counter-pipeline" deleted
    kubectl delete model identity-loop -n seldon-mesh
    kubectl delete model identity-output -n seldon-mesh
    kubectl delete model counter -n seldon-mesh
    model.mlops.seldon.io "identity-loop" deleted
    model.mlops.seldon.io "identity-output" deleted
    model.mlops.seldon.io "counter" deleted
    kubectl create namespace tracing
    helm repo add jaegertracing s://jaegertracing.github.io/helm-charts
    helm repo update
    rbac:
      clusterRole: true
      create: true
      pspEnabled: false
    helm upgrade tracing jaegertracing/jaeger-operator \
      --version 2.57.0 \
      -f tracing-values.yaml \
      -n tracing \
      --install
    kubectl get pods -n tracing
    NAME                                       READY   STATUS    RESTARTS   AGE
    tracing-jaeger-operator-549b79b848-h4p4d   1/1     Running   0          96s
    apiVersion: jaegertracing.io/v1
    kind: Jaeger
    metadata:
      name: simplest
      namespace: seldon-mesh
    kubectl apply -f jaeger-simplest.yaml
    kubectl get pods -n seldon-mesh | grep simplest
    NAME                       READY  STATUS    RESTARTS   AGE
    simplest-8686f5d96-4ptb4   1/1    Running   0          45s
    kubectl patch seldonruntime seldon -n seldon-mesh \
      --type merge \
      -p '{"spec":{"config":{"kafkaConfig":{"bootstrap.servers":"seldon-kafka-bootstrap.seldon-mesh:9092","consumer":{"auto.offset.reset":"earliest"},"topics":{"numPartitions":4}},"scalingConfig":{"servers":{}},"tracingConfig":{"otelExporterEndpoint":"simplest-collector.seldon-mesh:4317"}}}}'
    seldonruntime.mlops.seldon.io/seldon patched
    spec:
      config:
        agentConfig:
          rclone: {}
        kafkaConfig:
          bootstrap.servers: seldon-kafka-bootstrap.seldon-mesh:9092
          consumer:
            auto.offset.reset: earliest
          topics:
            numPartitions: 4
        scalingConfig:
          servers: {}
        serviceConfig: {}
        tracingConfig:
          otelExporterEndpoint: simplest-collector.seldon-mesh:4317
    kubectl port-forward svc/simplest-query -n seldon-mesh 16686:16686
    http://localhost:16686
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline
    spec:
      allowCycles: true
      maxStepRevisits: 100
      ...
    type PipelineSpec struct {
    	// External inputs to this pipeline, optional
    	Input *PipelineInput `json:"input,omitempty"`
    
    	// The steps of this inference graph pipeline
    	Steps []PipelineStep `json:"steps"`
    
    	// Synchronous output from this pipeline, optional
    	Output *PipelineOutput `json:"output,omitempty"`
    
    	// Dataflow specs
    	Dataflow *DataflowSpec `json:"dataflow,omitempty"`
    
    	// Allow cyclic pipeline
    	AllowCycles bool `json:"allowCycles,omitempty"`
    
    	// Maximum number of times a step can be revisited
    	MaxStepRevisits uint32 `json:"maxStepRevisits,omitempty"` 
    }
    
    type DataflowSpec struct {
    	// Flag to indicate whether the kafka input/output topics
    	// should be cleaned up when the model is deleted
    	// Default false
    	CleanTopicsOnDelete bool `json:"cleanTopicsOnDelete,omitempty"`
    }
    
    // +kubebuilder:validation:Enum=inner;outer;any
    type JoinType string
    
    const (
    	// data must be available from all inputs
    	JoinTypeInner JoinType = "inner"
    	// data will include any data from any inputs at end of window
    	JoinTypeOuter JoinType = "outer"
    	// first data input that arrives will be forwarded
    	JoinTypeAny JoinType = "any"
    )
    
    type PipelineStep struct {
    	// Name of the step
    	Name string `json:"name"`
    
    	// Previous step to receive data from
    	Inputs []string `json:"inputs,omitempty"`
    
    	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
    	JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
    
    	// Map of tensor name conversions to use e.g. output1 -> input1
    	TensorMap map[string]string `json:"tensorMap,omitempty"`
    
    	// Triggers required to activate step
    	Triggers []string `json:"triggers,omitempty"`
    
    	// +kubebuilder:default=inner
    	InputsJoinType *JoinType `json:"inputsJoinType,omitempty"`
    
    	TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
    
    	// Batch size of request required before data will be sent to this step
    	Batch *PipelineBatch `json:"batch,omitempty"`
    }
    
    type PipelineBatch struct {
    	Size     *uint32 `json:"size,omitempty"`
    	WindowMs *uint32 `json:"windowMs,omitempty"`
    	Rolling  bool    `json:"rolling,omitempty"`
    }
    
    type PipelineInput struct {
    	// Previous external pipeline steps to receive data from
    	ExternalInputs []string `json:"externalInputs,omitempty"`
    
    	// Triggers required to activate inputs
    	ExternalTriggers []string `json:"externalTriggers,omitempty"`
    
    	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
    	JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
    
    	// +kubebuilder:default=inner
    	JoinType *JoinType `json:"joinType,omitempty"`
    
    	// +kubebuilder:default=inner
    	TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
    
    	// Map of tensor name conversions to use e.g. output1 -> input1
    	TensorMap map[string]string `json:"tensorMap,omitempty"`
    }
    
    type PipelineOutput struct {
    	// Previous step to receive data from
    	Steps []string `json:"steps,omitempty"`
    
    	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
    	JoinWindowMs uint32 `json:"joinWindowMs,omitempty"`
    
    	// +kubebuilder:default=inner
    	StepsJoin *JoinType `json:"stepsJoin,omitempty"`
    
    	// Map of tensor name conversions to use e.g. output1 -> input1
    	TensorMap map[string]string `json:"tensorMap,omitempty"`
    }
    ML research work
    history and context
    Generate traffic
    Visualize the traces
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/experiments/addmul10-mirror.yaml
    where Server and Model replicas are independently scaled using HPA that targets the same metric.

    To enable autoscaling of server replicas, the following requirements need to be met:

    1. Setting minReplicas and maxReplicas in the Server CR. This will define the minimum and maximum number of server replicas that can be created.

    2. Setting the autoscaling.autoscalingServerEnabled value to true (default) during installation of the Core 2 seldon-core-v2-setup helm chart. If not installing via helm, setting the ENABLE_SERVER_AUTOSCALING environment variable to true in the seldon-scheduler podSpec (via either SeldonConfig or a SeldonRuntime podSpec override) will have the same effect. This will enable the autoscaling of server replicas.

    An example of a Server CR with autoscaling enabled is shown below:

    Note: Not setting minReplicas and/or maxReplicas will also effectively disable autoscaling of server replicas. In this case, the user will need to manually scale the server replicas by setting the replicas field in the Server CR. This allows external autoscaling mechanisms to be used e.g. HPA.

    Server Scaling Logic

    Scale Up

    When we want to scale up the number of replicas for a model, the associated servers might not have enough capacity (replicas) available. In this case we need to scale up the server replicas to match the number required by our models. There is currently only one policy for scaling up server replicas, and that is via Model Replica Count. This policy scales up the server replicas to match the number of model replicas that are required. In other words, if a model is scaled up, the system will scale up the server replicas to host these models. This policy ensures that the server replicas are scaled up in response to changes in the number of model replicas that are required. During the scale up process, the system will create new server replicas to host the new model replicas. The new server replicas will be created with the same configuration as the existing server replicas. This includes the server configuration, resources, etc. The new server replicas will be added to the existing server replicas and will be used to host the new model replicas.

    There is a period of time where the new server replicas are being created and the new model replicas are being loaded onto these server replicas. During this period, the system will ensure that the existing server replicas are still serving load so that there is no downtime during the scale up process. This is achieved by using partial scheduling of the new model replicas onto the new server replicas. This ensures that the new server replicas are gradually loaded with the new model replicas and that the existing server replicas are still serving load. Check the Partial Scheduling document for more details.

    Scale Down

    Once we have scaled down the number of replicas for a model, some of the corresponding server replicas might be left unused (depending on whether those replicas are hosting other models or not). If that is the case, the extra server pods might incur unnecessary infrastructure cost (especially if they have expensive resources such as GPUs attached). Scaling down servers in sync with models is not straightforward in the case of Multi-Model Serving. Scaling down one model does not necessarily mean that we also need to scale down the corresponding Server replica as this replica might be still serving load for other models. Therefore we define heuristics that can be used to scale down servers if we think that they are not properly used, described in the policies below.

    Note: Scaling down the number of replicas for an inference server does not necessarily mean that the system is going to remove a specific replica that we want. As currently we have Servers deployed as StatefulSets, scaling down the number of replicas will mean that we are removing the pod with the largest index.

    Upon scaling down Servers, the system will rebalance. Models from a draining server replica will be rescheduled after some wait time. This draining process is done without incurring downtime as models are being rescheduled onto other server replicas before the draining server replica is removed.

    Policies

    There are two possible policies we use to define the scale down of Servers:

    1. Empty Server Replica: In the simplest case we can remove a server replica if it does not host any models. This guarantees that there is no load on a particular server replica before removing it. This policy works best in the case of single model serving where the server replicas are only hosting a single model. In this case, if the model is scaled down, the server replica will be empty and can be removed.

    However in the case of MMS, only reducing the number of server replicas when one of the replicas no longer hosts any models can lead to a suboptimal packing of models onto server replicas. This is because the system will not automatically pack models onto the smaller set of replicas. This can lead to more server replicas being used than necessary. This can be mitigated by the lightly loaded server replicas policy.

    1. Lightly Loaded Server Replicas (Experimental):

    Warning: This policy is experimental and is not enabled by default. It can be enabled by setting autoscaling.serverPackingEnabled to true and autoscaling.serverPackingPercentage to a value between 0 and 100. This policy is still under development and might in some cases increase latencies, so it's worth testing ahead of time to observe behavior for a given setup.

    Using the above policy which MMS enabled, different model replicas will be hosted on potentially different server replicas and as we scale these models up and down the system can end up in a situation where the models are not consolidated to an optimized number of servers. For illustration, take the case of 3 Models: AAA, BBB and CCC. We have 1 server SSS with 2 replicas: S1S_1S1​ and S2S_2S2​ that can host these 3 models. Assuming that initially we have AAA and BBB with 1 replica and CCC with 2 replicas therefore the assignment is:

    Initial assignment:

    • S1S_1S1​: A1A_1A1​, C1C_1C1​

    • S2S_2S2​: B1B_1B1​, C2C_2C2​

    Now if the user unloads Model CCC the assignment is:

    • S1S_1S1​: A1A_1A1​

    • S2S_2S2​: B1B_1B1​

    There is an argument that this is might not be optimized and in MMS the assignment could be:

    • S1S_1S1​: A1A_1A1​, B1B_1B1​

    • S2S_2S2​: removed

    As the system evolves this imbalance can get larger and could cause the serving infrastructure to be less optimized. The behavior above is actually not limited to autoscaling, however autoscaling will aggravate the issue causing more imbalance over time. This imbalance can be mitigated by making the following observation: If the max number of replicas of any given model (assigned to a server from a logical point of view) is less than the number of replicas for this server, then we can pack the models hosted onto a smaller set of replicas. Note in Core 2 a server replica can host only 1 replica of a given model.

    In other words, consider the following example - for models AAA and BBB having 2 replicas each and we have 3 server SSS replicas, the following assignment is not potentially optimized:

    • S1S_1S1​: A1A_1A1​, B1B_1B1​

    • S2S_2S2​: A2A_2A2​

    • S3S_3S3​: B2B_2B2​

    In this case we could trigger removal of S3S_3S3​ for the server which could pack the models more appropriately

    • S1S_1S1​: A1A_1A1​, B1B_1B1​

    • S2S_2S2​: A2A_2A2​, B2B_2B2​

    • S3S_3S3​: removed

    While this heuristic is going to pack models onto a set of fewer replicas, which allows us to scale models down, there is still the risk that the packing could increase latencies, trigger a later scale up. Core 2 ensures consistent behavior without reverting between states. The user can also reduce the number of packing events by setting autoscaling.serverPackingPercentage to a lower value.

    Currently Core 2 triggers the packing logic only when there is model replica being removed, either from a model scale down or a model being deleted. In the future, the logic might be triggered more frequently to improve model packing onto a smaller set of replicas.

    here
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: sklearn-mirror
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 100
      mirror:
        name: iris2
        percent: 100
    
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: addmul10-mirror
    spec:
      default: pipeline-add10
      resourceType: pipeline
      candidates:
      - name: pipeline-add10
        weight: 100
      mirror:
        name: pipeline-mul10
        percent: 100
    
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver
      namespace: seldon
    spec:
      replicas: 2
      minReplicas: 1
      maxReplicas: 4
      serverConfig: mlserver
    Amazon MSK (security: mTLS)
  • Amazon MSK (security: SASL/SCRAM)

  • Azure Event Hub (security: SASL/PLAIN)

  • Integrating managed Kafka services

    These instructions outline the necessary configurations to integrate managed Kafka services with Seldon Core 2.

    1. Examples configurations

    2. Configuring with Seldon Core 2

    Securing managed Kafka services

    You can secure Seldon Core 2 integration with managed Kafka services using:

    • Kafka Encryption

    • Kafka Authentication

    Kafka Encryption (TLS)

    In production settings, always set up TLS encryption with Kafka. This ensures that neither the credentials nor the payloads are transported in plaintext.

    Note: TLS encryption involves only single-sided TLS. This means that the contents are encrypted and sent to the server, but the client won't send any form of certificate. Therefore, it does not take care of authenticating the client. Client authentication can be configured through mutual TLS (mTLS) or SASL mechanism, which are covered in the Kafka Authentication section .

    When TLS is enabled, the client needs to know the root CA certificate used to create the server's certificate. This is used to validate the certificate sent back by the Kafka server.

    1. Create a certificate named ca.crt that is encoded as a PEM certificate. It is important that the certificate is saved as ca.crt. Otherwise, Seldon Core 2 may not be able to find the certificate. Within the cluster, you can provide the server's root CA certificate through a secret. For example, a secret named kafka-broker-tls with a certificate.

    Kafka Authentication

    In production environments, Kafka clusters often require authentication, especially when using managed Kafka solutions. Therefore, when installing Seldon Core 2 components, it is crucial to provide the correct credentials for a secure connection to Kafka.

    The type of authentication used with Kafka varies depending on the setup but typically includes one of the following:

    • Simple Authentication and Security Layer (SASL): Requires a username and password.

    • Mutual TLS (mTLS): Involves using SSL certificates as credentials.

    • OAuth 2.0: Uses the client credential flow to acquire a JWT token.

    These credentials are stored as Kubernetes secrets within the cluster. When setting up Seldon Core 2 you must create the appropriate secret in the correct format and update the components-values.yaml, and install-values files respectively.

    When you use SASL as the authentication mechanism for Kafka, the credentials consist of a username and password pair. The password is supplied through a secret.

    Note:

    • Ensure that the field used for the password within the secret is named password. Otherwise, Seldon Core 2 may not be able to find the correct password.

    • This password must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

    1. Create a password for Seldon Core 2 in the namespace seldon-mesh.

    Values in Seldon Core 2

    In Seldon Core 2 you need to specify these values in

    • security.kafka.sasl.mechanism - SASL security mechanism, e.g. SCRAM-SHA-512

    • security.kafka.sasl.client.username - Kafka username

    • security.kafka.sasl.client.secret - Created secret with password

    The resulting set of values to include in values.yaml is similar to:

    The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.

    When you use OAuth 2.0 as the authentication mechanism for Kafka, the credentials consist of a Client ID and Client Secret, which are used with your Identity Provider to obtain JWT tokens for authenticating with Kafka brokers.

    1. Create a Kubernetes secret kafka-oauth.yaml file.

    2. Store the secret in the seldon-mesh namespace to configure with Seldon Core 2.

      This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

    When you use mTLS as authentication mechanism Kafka uses a set of certificates to authenticate the client.

    • A client certificate, referred to as tls.crt.

    • A client key, referred to as tls.key.

    Example configurations for managed Kafka services

    Here are some examples to create secrets for managed Kafka services such as Azure Event Hub, Confluent Cloud(SASL), Confluent Cloud(OAuth2.0).

    Prerequisites:

    • You must use at least the Standard tier for your Event Hub namespace because the Basic tier does not support the Kafka protocol.

    • Seldon Core 2 creates two Kafka topics for each model and pipeline, plus one global topic for errors. This results in a total number of topics calculated as: 2 x (number of models + number of pipelines) + 1. This topic count is likely to exceed the limit of the Standard tier in Azure Event Hub. For more information, see quota information.

    Creating a namespace and obtaining the connection string

    These are the steps that you need to perform in Azure Portal.

    1. Create an Azure Event Hub namespace. You need to have an Azure Event Hub namespace. Follow the to create one. Note: You do not need to create individual Event Hubs (topics) as Seldon Core 2 automatically creates all necessary topics.

    2. Connection string for Kafka Integration. To connect to the Azure Event Hub using the Kafka API, you need to obtain Kafka endpoint and Connection string. For more information, see

      Note: Ensure you get the Connection string at the namespace level, as it is needed to dynamically create new topics. The format of the Connection string should be:

    Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.

    1. Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.

    1. Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon-system. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.

    Creating API Keys

    These are the steps that you need to perform in Confluent Cloud.

    1. Navigate to Clients > New client and choose a client, for example GO and generate new Kafka cluster API key. For more information, see .

    Confluent generates a configuration file with the details.

    1. Save the values of Key, Secret

    Confluent Cloud managed Kafka supports OAuth 2.0 to authenticate your Kafka clients. See Confluent Cloud documentation for further details.

    Configuring Identity Provider

    In Confluent Cloud Console Navigate to Account & Access / Identity providers and complete these steps:

    1. register your Identity Provider. See Confluent Cloud documentation for further details.

    2. add new identity pool to your newly registered Identity Provider. See Confluent Cloud documentation for further details.

    Configuring Seldon Core 2

    To integrate Kafka with Seldon Core 2.

    1. Update the initial configuration.

    Note: In these configurations you may need:

    • to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.

    • set the value for username as $ConnectionString this is not a variable.

    • replace <namespace> with the namespace in Azure Event Hub.

    Update the initial configuration for Seldon Core 2 in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

    Note: In these configurations you may need:

    • to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.

    Note: In these configurations you may need:

    • to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.

    1. To enable Kafka Encryption (TLS) you need to reference the secret that you created in the security.kafka.ssl.client.secret field of the Helm chart values. The resulting set of values to include in components-values.yaml is similar to:

    1. Change to the directory that contains the components-values.yaml file and then install Seldon Core 2 operator in the namespace seldon-system.

    After you integrated Seldon Core 2 with Kafka, you need to Install an Ingress Controller that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.

    Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics

  • Configuring HPA manifests to scale Models

  • Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from prometheus-adapter, it will need to be removed before being able to scale Core 2 models and servers via HPA. The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.

    Installing and configuring the Prometheus Adapter

    The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.

    To install through helm:

    These commands install prometheus-adapter as a helm release named hpa-metrics in the same namespace where Prometheus is installed, and point to its service URL (without the port).

    The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.

    If you are running Prometheus on a different port than the default 9090, you can also pass --set prometheus.port=[custom_port] You may inspect all the options available as helm values by running helm show values prometheus-community/prometheus-adapter

    Please check that the metricsRelistInterval helm value (default to 1m) works well in your setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus scrape interval. The corresponding prometheus adapter command-line argument is --metrics-relist-interval. If the relist interval is set incorrectly, it will lead to some of the custom metrics being intermittently reported as missing.

    We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a ConfigMap in the same namespace as itself, named [helm_release_name]-prometheus-adapter. In our case, it will be hpa-metrics-prometheus-adapter.

    Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.

    Change the name if you've chosen a different value for the prometheus-adapter helm release name. Change the namespace to match the namespace where prometheus-adapter is installed.

    In this example, a single rule is defined to fetch the seldon_model_infer_total metric from Prometheus, compute its per second change rate based on data within a 2 minute sliding window, and expose this to Kubernetes as the infer_rps metric, with aggregations available at model, server, inference server pod and namespace level.

    When HPA requests the infer_rps metric via the custom metrics API for a specific model, prometheus-adapter issues a Prometheus query in line with what it is defined in its config.

    For the configuration in our example, the query for a model named irisa0 in namespace seldon-mesh would be:

    You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate() with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window before query time are used to compute the RPS.

    It is important to sanity-check the query by executing it against your Prometheus instance. To do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval (Prometheus default 1 minute), so that two values from the series are captured and a rate can be computed. Finally, you can modify the model name and namespace in the query above to match the model you've picked and execute the query.

    If the query result is empty, please adjust it until it consistently returns the expected metric values. Pay special attention to the window size (2 minutes in the example): if it is smaller than twice the Prometheus scrape interval, the query may return no results. A compromise needs to be reached to set the window size large enough to reject noise but also small enough to make the result responsive to quick changes in load.

    Update the metricsQuery in the prometheus-adapter ConfigMap to match any query changes you have made during tests.

    A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.

    Customizing prometheus-adapter rule definitions

    The rule definition can be broken down in four parts:

    Discovery

    Discovery (the seriesQuery and seriesFilters keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.

    As an alternative to the example above, all the Seldon Prometheus metrics of the form seldon_model.*_total could be considered, followed by excluding metrics pre-aggregated across all models (.*_aggregate_.*) as well as the cummulative infer time per model (.*_seconds_total):

    For RPS, we are only interested in the model inference count (seldon_model_infer_total)

    Association

    Association (the resources key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.

    The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"} lets prometheus-adapter know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io CR.

    One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris or for servers.mlops.seldon.io/mlserver.

    The labels that do not refer to a namespace resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.

    Naming

    Naming (the name key) configures the naming of the k8s custom metric.

    In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total and expose custom metric endpoints named infer_rps, which when called return the result of a query over the Prometheus metric. Instead of a literal match, one could also use regex group capture expressions, which can then be referenced in the custom metric name:

    Querying

    Querying (the metricsQuery key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.

    The query can make use of the following placeholders:

    For a complete reference for how prometheus-adapter can be configured via the ConfigMap, please consult the docs here.

    Once you have applied any necessary customizations, replace the default prometheus-adapter config with the new one, and restart the deployment (this restart is required so that prometheus-adapter picks up the new config):

    Testing the install using the custom metrics API

    In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API

    Note: If no inference requests were issued towards any model in the Seldon install, the metrics configured above will not be available in prometheus, and thus will also not appear when checking via the commands below. Therefore, please first run some inference requests towards a sample model to ensure that the metrics are available — this is only required for the testing of the install.

    Listing the available metrics:

    For namespaced metrics, the general template for fetching is:

    For example:

    • Fetching model RPS metric for a specific (namespace, model) pair (seldon-mesh, irisa0):

    • Fetching model RPS metric aggregated at the (namespace, server) level (seldon-mesh, mlserver):

    • Fetching model RPS metric aggregated at the (namespace, pod) level (seldon-mesh, mlserver-0):

    • Fetching the same metric aggregated at namespace level (seldon-mesh):

    Once metrics are exposed properly, users can use HPA to trigger autoscaling of Kubernetes resources, including custom resources such as Seldon Core's Models and Servers. Implementing autoscaling with HPA for Models, or for Models and Servers together is explained in the following pages.

    HPA
    Prometheus Adapter

    Batch Inference examples (local)

    This example runs you through a series of batch inference requests made to both models and pipelines running on Seldon Core locally.

    Deprecated: The MLServer CLI infer feature is experimental and will be removed in future work.

    Setup

    If you haven't already, you'll need to before you run through this example.

    Note: By default, the CLI will expect your inference endpoint to be at 0.0.0.0:9000. If you have customized this, you'll need to .

    Deploy Models and Pipelines

    First, let's jump in to the samples folder where we'll find some sample models and pipelines we can use:

    Deploy the Iris Model

    Let's take a look at a sample model before we deploy it:

    The above manifest will deploy a simple model based on the .

    Let's now deploy that model using the Seldon CLI:

    Deploy the Iris Pipeline

    Now that we've deployed our iris model, let's create a around the model.

    We see that this pipeline only has one step, which is to call the iris model we deployed earlier. We can create the pipeline by running:

    Deploy the Tensorflow Model

    To demonstrate batch inference requests to different types of models, we'll also deploy a simple model:

    The tensorflow model takes two arrays as inputs and returns two arrays as outputs. The first output is the addition of the two inputs and the second output is the value of (first input - second input).

    Let's deploy the model:

    Deploy the Tensorflow Pipeline

    Just as we did for the scikit-learn model, we'll deploy a simple pipeline for our tensorflow model:

    Inspect the pipeline manifest:

    and deploy it:

    Check Model and Pipeline Status

    Once we've deployed a model or pipeline to Seldon Core, we can list them and check their status by running:

    and

    Your models and pieplines should be showing a state of ModelAvailable and PipelineReady respectively.

    Test Predictions

    Before we run a large batch job of predictions through our models and pipelines, let's quickly check that they work with a single standalone inference request. We can do this using theseldon model infer command.

    Scikit-learn Model

    The preidiction request body needs to be an compatible payload and also match the expected inputs for the model you've deployed. In this case, the iris model expects data of shape [1, 4] and of type FP32.

    You'll notice that the prediction results for this request come back on outputs[0].data.

    Scikit-learn Pipeline

    Tensorflow Model

    You'll notice that the inputs for our tensorflow model look different from the ones we sent to the iris model. This time, we're sending two arrays of shape [1,16]. When sending an inference request, we can optionally chose which outputs we want back by including an {"outputs":...} object.

    Tensorflow Pipeline

    Running the Scikit-Learn Batch Job

    In the samples folder there is a batch request input file: batch-inputs/iris-input.txt. It contains 100 input payloads for our iris model. Let's take a look at the first line in that file:

    To run a batch inference job we'll use the . If you don't already have it installed you can install it using:

    Iris Model

    The inference job can be executed by running the following command:

    The mlserver batch component will take your input file batch-inputs/iris-input.txt, distribute those payloads across 5 different workers (--workers 5), collect the responses and write them to a file /tmp/iris-output.txt. For a full set of options check out the.

    Checking the Output

    We can check the inference responses by looking at the contents of the output file:

    Iris Pipeline

    We can run the same batch job for our iris pipeline and store the outputs in a different file:

    Checking the Output

    We can check the inference responses by looking at the contents of the output file:

    Running the Tensorflow Batch Job

    The samples folder contains an example batch input for the tensorflow model, just as it did for the scikit-learn model. You can find it at batch-inputs/tfsimple-input.txt. Let's take a look at the first inference request in the file:

    Tensorflow Model

    As before, we can run the inference batch job using the mlserver infer command:

    Checking the Output

    We can check the inference responses by looking at the contents of the output file:

    You should get the following response:

    Tensorflow Pipeline

    Checking the Output

    We can check the inference responses by looking at the contents of the output file:

    Cleaning Up

    Now that we've run our batch examples, let's remove the models and pipelines we created:

    And finally let's spin down our local instance of Seldon Core:

    Usage Metrics

    Learn how to monitor Seldon Core usage metrics, including request rates, latency, and resource utilization for models and pipelines.

    There are various interesting system metrics about how Seldon Core 2 is used. These metrics can be recorded anonymously and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.

    When provided, these metrics are used to understand the adoption of Seldon Core 2 and how you interact with it. For example, knowing how many clusters Seldon Core 2 is running on, if it is used in Kubernetes or for local development, and how many users are benefitting from features such as multi-model serving.

    Architecture

    Hodometer is not an integral part of Seldon Core 2, but rather an independent component which connects to the public APIs of the Seldon Core 2 scheduler. If deployed in Kubernetes, it requests some basic information from the Kubernetes API.

    Recorded metrics are sent to Seldon and, optionally, to any you define.

    Privacy

    Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.

    It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses, model names, or user information. All information sent to Seldon is anonymised with a completely random cluster identifier.

    Hodometer supports , so you have full control over what metrics are provided to Seldon, if any.

    For transparency, the implementation is fully open-source and designed to be easy to read. The full source code is available , with metrics defined in code . See for an equivalent table of metrics.

    Performance

    Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal impact on CPU, memory, and network consumption.

    Hodometer does not store anything it records, so does not have any persistent storage. As a result, it should not be considered a replacement for tools like Prometheus.

    Configuration

    Metrics levels

    Hodometer supports 3 different metrics levels:

    Level
    Description

    Alternatively, usage metrics can be completely disabled. To do so, simply remove any existing deployment of Hodometer or disable it in the installation for your environment, discussed below.

    Options

    The following environment variables control the behaviour of Hodometer, regardless of the environment it is installed in.

    Flag
    Format
    Example
    Description

    Kubernetes

    Hodometer is installed as a separate deployment, by default in the same namespace as the rest of the Seldon components.

    Helm

    If you install Seldon Core v2 by , there are values corresponding to the key environment variables discussed . These Helm values and their equivalents are provided below:

    Helm value
    Environment variable

    If you do not want usage metrics to be recorded, you can disable Hodometer via the hodometer.disable Helm value when installing the runtime Helm chart. The following command disables collection of usage metrics in fresh installations and also serves to remove Hodometer from an existing installation:

    Note: It is a good practice to set Helm values in values file. These can be applied by using the -f <filename> switch when running Helm.

    Extra publish URLs

    Hodometer can be instructed to publish metrics not only to Seldon, but also to any extra endpoints you specify. This is controlled by the EXTRA_PUBLISH_URLS environment variable, which expects a comma-separated list of HTTP-compatible URLs.

    You might choose to use this for your own usage monitoring. For example, you could capture these metrics and expose them to Prometheus or another monitoring system using your own service.

    Metrics are recorded in MixPanel-compatible format, which employs a highly flexible JSON schema.

    For an example of how to define your own metrics listener, see the in the hodometer sub-project.

    List of metrics

    Metric name
    Level
    Format
    Notes

    Production income classifier with drift, outlier and explanations

    To run this notebook you need the inference data. This can be acquired in two ways:

    • Run make train or,

    • gsutil cp -R gs://seldon-models/scv2/examples/income/infer-data .

    Pipeline with model, drift detector and outlier detector

    Show predictions from reference set. Should not be drift or outliers.

    Show predictions from drift data. Should be drift and probably not outliers.

    Show predictions from outlier data. Should be outliers and probably not drift.

    Explanations

    Cleanup

    Helm Configuration

    Learn how to install and configure Seldon Core using Helm charts, including component setup and customization options.

    Seldon Core 2 provides a highly configurable deployment framework that allows you to fine-tune various components using Helm configuration options. These options offer control over deployment behavior, resource management, logging, autoscaling, and model lifecycle policies to optimize the performance and scalability of machine learning deployments.

    This section details the key Helm configuration parameters for Envoy, Autoscaling, Server Prestop, and Model Control Plane, ensuring that you can customize deployment workflows and enhance operational efficiency.

    • Envoy: Manage pre-stop behaviors and configure access logging to track request-level interactions.

    • Autoscaling (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.

    • Servers: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.

    • Logging: Define log levels for the different components of the system.

    Envoy

    Prestop

    Key
    Chart
    Description
    Default

    Access Log

    Key
    Chart
    Description
    Default

    Autoscaling

    Native autoscaling (experimental)

    Key
    Chart
    Description
    Default

    Server

    Prestop

    Key
    Chart
    Description
    Default

    Model Control Plane

    Key
    Chart
    Description
    Default

    Logging

    Component Log Level

    Key
    Chart
    Description
    Default

    Notes:

    • We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by librdkafka (syslog level). In this case we attempt to map the log level value to the best match.

    https://github.com/SeldonIO/seldon-core/blob/v2/operator/config/serverconfigs/mlserver.yaml
    kubectl create secret generic kafka-broker-tls -n seldon-mesh --from-file ./ca.crt
    security:
      kafka:
        ssl:
          secret:
            brokerValidationSecret: kafka-broker-tls
     helm upgrade seldon-core-v2-components seldon-charts/seldon-core-v2-setup \
     --version 2.8.0 \
     -f components-values.yaml \
     --namespace seldon-system \
     --install
    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/models.mlops.seldon.io/irisa0/infer_rps
    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/servers.mlops.seldon.io/mlserver/infer_rps
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoring
    prometheus-adapter.config.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: hpa-metrics-prometheus-adapter
      namespace: seldon-monitoring
    data:
      config.yaml: |-
        "rules":
        -
          "seriesQuery": |
             {__name__="seldon_model_infer_total",namespace!=""}
          "resources":
            "overrides":
              "model": {group: "mlops.seldon.io", resource: "model"}
              "server": {group: "mlops.seldon.io", resource: "server"}
              "pod": {resource: "pod"}
              "namespace": {resource: "namespace"}
          "name":
            "matches": "seldon_model_infer_total"
            "as": "infer_rps"
          "metricsQuery": |
            sum by (<<.GroupBy>>) (
              rate (
                <<.Series>>{<<.LabelMatchers>>}[2m]
              )
            )
    sum by (model) (
      rate (
        seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[2m]
      )
    )
    ```yaml
    "seriesQuery": |
            {__name__=~"^seldon_model.*_total",namespace!=""}
        "seriesFilters":
            - "isNot": "^seldon_.*_seconds_total"
            - "isNot": "^seldon_.*_aggregate_.*"
    ...
    ```
    "name":
      "matches": "^seldon_model_(.*)_total"
      "as": "${1}_rps"
    - .Series is replaced by the discovered prometheus metric name (e.g. `seldon_model_infer_total`)
    - .LabelMatchers, when requesting a namespaced metric for resource `X` with name `x` in namespace `n`, is replaced by `X=~"x",namespace="n"`. For example, `model=~"iris0", namespace="seldon-mesh"`. When requesting the namespace resource itself, only the `namespace="n"` is kept.
    - .GroupBy is replaced by the resource type of the requested metric (e.g. `model`, `server`, `pod` or `namespace`).
    # Replace default prometheus adapter config
    kubectl replace -f prometheus-adapter.config.yaml
    # Restart prometheus-adapter pods
    kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoring
    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq .
    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/[NAMESPACE]/[API_RESOURCE_NAME]/[CR_NAME]/[METRIC_NAME]"
    import numpy as np
    import json
    import requests
    with open('./infer-data/test.npy', 'rb') as f:
        x_ref = np.load(f)
        x_h1 = np.load(f)
        y_ref = np.load(f)
        x_outlier = np.load(f)
    reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    def infer(resourceName: str, batchSz: int, requestType: str):
        if requestType == "outlier":
            rows = x_outlier[0:0+batchSz]
        elif requestType == "drift":
            rows = x_h1[0:0+batchSz]
        else:
            rows = x_ref[0:0+batchSz]
        reqJson["inputs"][0]["data"] = rows.flatten().tolist()
        reqJson["inputs"][0]["shape"] = [batchSz, rows.shape[1]]
        headers = {"Content-Type": "application/json", "seldon-model":resourceName}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        print(response_raw)
        print(response_raw.json())
    clone the Seldon Core repository and run it locally
    redirect the CLI
    sci-kit learn
    iris dataset
    pipeline
    tensorflow
    Open Inference Protocol
    MLServer CLI
    MLServer CLI Reference

    SCHEDULER_PORT

    integer

    9004

    Port for Seldon Core v2 scheduler

    LOG_LEVEL

    string

    info

    Level of detail for application logs

    is_kubernetes

    cluster

    Boolean

    Whether or not the installation is in Kubernetes

    kubernetes_version

    cluster

    Version number

    Kubernetes server version, if inside Kubernetes

    node_count

    cluster

    Integer

    Number of nodes in the cluster, if inside Kubernetes

    model_count

    resource

    Integer

    Number of Model resources

    pipeline_count

    resource

    Integer

    Number of Pipeline resources

    experiment_count

    resource

    Integer

    Number of Experiment resources

    server_count

    resource

    Integer

    Number of Server resources

    server_replica_count

    resource

    Integer

    Total number of Server resource replicas

    multimodel_enabled_count

    feature

    Integer

    Number of Server resources with multi-model serving enabled

    overcommit_enabled_count

    feature

    Integer

    Number of Server resources with overcommitting enabled

    gpu_enabled_count

    feature

    Integer

    Number of Server resources with GPUs attached

    inference_server_name

    feature

    String

    Name of inference server, e.g. MLServer or Triton

    server_cpu_cores_sum

    feature

    Float

    Total of CPU limits across all Server resource replicas, in cores

    server_memory_gb_sum

    feature

    Float

    Total of memory limits across all Server resource replicas, in GiB

    Cluster

    Basic information about the Seldon Core v2 installation

    Resource

    High-level information about which Seldon Core v2 resources are used

    Feature

    More detailed information about how resources are used and whether or not certain feature flags are enabled

    METRICS_LEVEL

    string

    feature

    Level of detail for recorded metrics; one of feature, resource, or cluster

    EXTRA_PUBLISH_URLS

    comma-separated list of URLs

    http://<my-endpoint-1>:8000,http://<my-endpoint-2>:8000

    Additional endpoints to publish metrics to

    SCHEDULER_HOST

    string

    seldon-scheduler

    hodometer.metricsLevel

    METRICS_LEVEL

    hodometer.extraPublishUrls

    EXTRA_PUBLISH_URLS

    hodometer.logLevel

    LOG_LEVEL

    cluster_id

    cluster

    UUID

    A random identifier for this cluster for de-duplication

    seldon_core_version

    cluster

    Version number

    E.g. 1.2.3

    is_global_installation

    cluster

    Boolean

    additional endpoints
    different information levels
    here
    here
    below
    Helm chart
    above
    receiver Go package

    Hostname for Seldon Core v2 scheduler

    Whether installation is global or namespaced

    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/pods/mlserver-0/infer_rps
    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer_rps
    cd samples/
    cat models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
      memory: 100Ki
    seldon model load -f models/sklearn-iris-gs.yaml
    cat pipelines/iris.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: iris-pipeline
    spec:
      steps:
        - name: iris
      output:
        steps:
        - iris
    seldon pipeline load -f pipelines/iris.yaml
    cat models/tfsimple1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    seldon model load -f models/tfsimple1.yaml
    cat pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    seldon pipeline load -f pipelines/tfsimple.yaml
    seldon model list
    seldon pipeline list
    seldon model infer iris '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq
    {
      "model_name": "iris_1",
      "model_version": "1",
      "id": "a67233c2-2f8c-4fbc-a87e-4e4d3d034c9f",
      "parameters": {
        "content_type": null,
        "headers": null
      },
      "outputs": [
        {
          "name": "predict",
          "shape": [
            1
          ],
          "datatype": "INT64",
          "parameters": null,
          "data": [
            2
          ]
        }
      ]
    }
    
    seldon pipeline infer iris-pipeline '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' |  jq
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2
          ],
          "name": "predict",
          "shape": [
            1
          ],
          "datatype": "INT64"
        }
      ]
    }
    
    seldon model infer tfsimple1 '{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ]
        }
      ]
    }
    seldon pipeline infer tfsimple '"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat batch-inputs/iris-input.txt | head -n 1 | jq
    {
      "inputs": [
        {
          "name": "predict",
          "data": [
            0.38606369295833043,
            0.006894049558299753,
            0.6104082981607108,
            0.3958954239450676
          ],
          "datatype": "FP64",
          "shape": [
            1,
            4
          ]
        }
      ]
    }
    
    pip install mlserver
    mlserver infer -u localhost:9000 -m iris -i batch-inputs/iris-input.txt -o /tmp/iris-output.txt --workers 5
    2023-01-22 18:24:17,272 [mlserver] INFO - Using asyncio event-loop policy: uvloop
    2023-01-22 18:24:17,273 [mlserver] INFO - server url: localhost:9000
    2023-01-22 18:24:17,273 [mlserver] INFO - model name: iris
    2023-01-22 18:24:17,273 [mlserver] INFO - request headers: {}
    2023-01-22 18:24:17,273 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
    2023-01-22 18:24:17,273 [mlserver] INFO - output file path: /tmp/iris-output.txt
    2023-01-22 18:24:17,273 [mlserver] INFO - workers: 5
    2023-01-22 18:24:17,273 [mlserver] INFO - retries: 3
    2023-01-22 18:24:17,273 [mlserver] INFO - batch interval: 0.0
    2023-01-22 18:24:17,274 [mlserver] INFO - batch jitter: 0.0
    2023-01-22 18:24:17,274 [mlserver] INFO - connection timeout: 60
    2023-01-22 18:24:17,274 [mlserver] INFO - micro-batch size: 1
    2023-01-22 18:24:17,420 [mlserver] INFO - Finalizer: processed instances: 100
    2023-01-22 18:24:17,421 [mlserver] INFO - Total processed instances: 100
    2023-01-22 18:24:17,421 [mlserver] INFO - Time taken: 0.15 seconds
    
    cat /tmp/iris-output.txt | head -n 1 | jq
    mlserver infer -u localhost:9000 -m iris-pipeline.pipeline -i batch-inputs/iris-input.txt -o /tmp/iris-pipeline-output.txt --workers 5
    2023-01-22 18:25:18,651 [mlserver] INFO - Using asyncio event-loop policy: uvloop
    2023-01-22 18:25:18,653 [mlserver] INFO - server url: localhost:9000
    2023-01-22 18:25:18,653 [mlserver] INFO - model name: iris-pipeline.pipeline
    2023-01-22 18:25:18,653 [mlserver] INFO - request headers: {}
    2023-01-22 18:25:18,653 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
    2023-01-22 18:25:18,653 [mlserver] INFO - output file path: /tmp/iris-pipeline-output.txt
    2023-01-22 18:25:18,653 [mlserver] INFO - workers: 5
    2023-01-22 18:25:18,653 [mlserver] INFO - retries: 3
    2023-01-22 18:25:18,653 [mlserver] INFO - batch interval: 0.0
    2023-01-22 18:25:18,653 [mlserver] INFO - batch jitter: 0.0
    2023-01-22 18:25:18,653 [mlserver] INFO - connection timeout: 60
    2023-01-22 18:25:18,653 [mlserver] INFO - micro-batch size: 1
    2023-01-22 18:25:18,963 [mlserver] INFO - Finalizer: processed instances: 100
    2023-01-22 18:25:18,963 [mlserver] INFO - Total processed instances: 100
    2023-01-22 18:25:18,963 [mlserver] INFO - Time taken: 0.31 seconds
    cat /tmp/iris-pipeline-output.txt | head -n 1 | jq
    cat batch-inputs/tfsimple-input.txt | head -n 1 | jq
    {
      "inputs": [
        {
          "name": "INPUT0",
          "data": [
            75,
            39,
            9,
            44,
            32,
            97,
            99,
            40,
            13,
            27,
            25,
            36,
            18,
            77,
            62,
            60
          ],
          "datatype": "INT32",
          "shape": [
            1,
            16
          ]
        },
        {
          "name": "INPUT1",
          "data": [
            39,
            7,
            14,
            58,
            13,
            88,
            98,
            66,
            97,
            57,
            49,
            3,
            49,
            63,
            37,
            12
          ],
          "datatype": "INT32",
          "shape": [
            1,
            16
          ]
        }
      ]
    }
    mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-output.txt --workers 10
    
    2023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
    2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
    2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
    2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
    2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
    2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-output.txt
    2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
    2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
    2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
    2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
    2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
    2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
    2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
    2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
    2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.21 seconds
    cat /tmp/tfsimple-output.txt | head -n 1 | jq
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9",
      "parameters": {
        "batch_index": 0,
        "inference_id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9"
      },
      "outputs": [
        {
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "parameters": {},
          "data": [
            114,
            46,
            23,
            102,
            45,
            185,
            197,
            106,
            110,
            84,
            74,
            39,
            67,
            140,
            99,
            72
          ]
        },
        {
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "parameters": {},
          "data": [
            36,
            32,
            -5,
            -14,
            19,
            9,
            1,
            -26,
            -84,
            -30,
            -24,
            33,
            -31,
            14,
            25,
            48
          ]
        }
      ]
    }
    mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-pipeline-output.txt --workers 10
    2023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
    2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
    2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
    2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
    2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
    2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-pipeline-output.txt
    2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
    2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
    2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
    2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
    2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
    2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
    2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
    2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
    2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.25 seconds
    
    cat /tmp/tfsimple-pipeline-output.txt | head -n 1 | jq
    seldon model unload iris
    seldon model unload tfsimple1
    seldon pipeline unload iris-pipeline
    seldon pipeline unload tfsimple
    cd ../ && make undeploy-local
    helm install seldon-v2-runtime k8s/helm-charts/seldon-core-v2-runtime \
      --namespace seldon-mesh \
      --set hodometer.disable=true
    cat ../../models/income-preprocess.yaml
    echo "---"
    cat ../../models/income.yaml
    echo "---"
    cat ../../models/income-drift.yaml
    echo "---"
    cat ../../models/income-outlier.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-preprocess
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/preprocessor"
      requirements:
      - sklearn
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
      requirements:
      - sklearn
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-drift
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/drift-detector"
      requirements:
        - mlserver
        - alibi-detect
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-outlier
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/outlier-detector"
      requirements:
        - mlserver
        - alibi-detect
    
    kubectl apply -f ../../models/income-preprocess.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/income.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/income-drift.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/income-outlier.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/income-preprocess created
    model.mlops.seldon.io/income created
    model.mlops.seldon.io/income-drift created
    model.mlops.seldon.io/income-outlier created
    kubectl wait --for condition=ready --timeout=300s model income-preprocess -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model income -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model income-drift -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model income-outlier -n ${NAMESPACE}
    model.mlops.seldon.io/income-preprocess condition met
    model.mlops.seldon.io/income condition met
    model.mlops.seldon.io/income-drift condition met
    model.mlops.seldon.io/income-outlier condition met
    seldon model load -f ../../models/income-preprocess.yaml
    seldon model load -f ../../models/income.yaml
    seldon model load -f ../../models/income-drift.yaml
    seldon model load -f ../../models/income-outlier.yaml
    {}
    {}
    {}
    {}
    seldon model status income-preprocess -w ModelAvailable | jq .
    seldon model status income -w ModelAvailable | jq .
    seldon model status income-drift -w ModelAvailable | jq .
    seldon model status income-outlier -w ModelAvailable | jq .
    {}
    {}
    {}
    {}
    cat ../../pipelines/income.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: income-production
    spec:
      steps:
        - name: income
        - name: income-preprocess
        - name: income-outlier
          inputs:
          - income-preprocess
        - name: income-drift
          batch:
            size: 20
      output:
        steps:
        - income
        - income-outlier.outputs.is_outlier
    
    kubectl apply -f ../../pipelines/income.yaml -n ${NAMESPACE}
    pipeline.mlops.seldon.io/income created
    kubectl wait --for condition=ready --timeout=300s pipelines income -n ${NAMESPACE}
    pipeline.mlops.seldon.io/choice condition met
    seldon pipeline load -f ../../pipelines/income.yaml
    seldon pipeline status income-production -w PipelineReady | jq -M .
    {
      "pipelineName": "income-production",
      "versions": [
        {
          "pipeline": {
            "name": "income-production",
            "uid": "cifej8iufmbc73e5int0",
            "version": 1,
            "steps": [
              {
                "name": "income"
              },
              {
                "name": "income-drift",
                "batch": {
                  "size": 20
                }
              },
              {
                "name": "income-outlier",
                "inputs": [
                  "income-preprocess.outputs"
                ]
              },
              {
                "name": "income-preprocess"
              }
            ],
            "output": {
              "steps": [
                "income.outputs",
                "income-outlier.outputs.is_outlier"
              ]
            },
            "kubernetesMeta": {}
          },
          "state": {
            "pipelineVersion": 1,
            "status": "PipelineReady",
            "reason": "created pipeline",
            "lastChangeTimestamp": "2023-06-30T14:41:38.343754921Z",
            "modelsReady": true
          }
        }
      ]
    }
    
    batchSz=20
    print(y_ref[0:batchSz])
    infer("income-production.pipeline",batchSz,"normal")
    [0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect income-production.income-drift.outputs.is_drift
    seldon.default.model.income-drift.outputs	cifej9gfh5ss738i5br0	{"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
    
    batchSz=20
    print(y_ref[0:batchSz])
    infer("income-production.pipeline",batchSz,"drift")
    [0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect income-production.income-drift.outputs.is_drift
    seldon.default.model.income-drift.outputs	cifejaofh5ss738i5brg	{"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["1"]}}
    
    batchSz=20
    print(y_ref[0:batchSz])
    infer("income-production.pipeline",batchSz,"outlier")
    [0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect income-production.income-drift.outputs.is_drift
    seldon.default.model.income-drift.outputs	cifejb8fh5ss738i5bs0	{"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
    
    cat ../../models/income-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
      explainer:
        type: anchor_tabular
        modelRef: income
    
    kubectl apply -f ../../models/income-explainer.yaml -n ${NAMESPACE}
    pipeline.mlops.seldon.io/income-explainer created
    kubectl wait --for condition=ready --timeout=300s pipelines income-explainer -n ${NAMESPACE}
    pipeline.mlops.seldon.io/income-explainer condition met
    seldon model load -f ../../models/income-explainer.yaml
    {}
    seldon model status income-explainer -w ModelAvailable | jq .
    {}
    batchSz=1
    print(y_ref[0:batchSz])
    infer("income-explainer",batchSz,"normal")
    [0]
    <Response [200]>
    {'model_name': 'income-explainer_1', 'model_version': '1', 'id': 'cdd68ba5-c569-4930-886f-fbdc26e24866', 'parameters': {}, 'outputs': [{'name': 'explanation', 'shape': [1, 1], 'datatype': 'BYTES', 'parameters': {'content_type': 'str'}, 'data': ['{"meta": {"name": "AnchorTabular", "type": ["blackbox"], "explanations": ["local"], "params": {"seed": 1, "disc_perc": [25, 50, 75], "threshold": 0.95, "delta": 0.1, "tau": 0.15, "batch_size": 100, "coverage_samples": 10000, "beam_size": 1, "stop_on_first": false, "max_anchor_size": null, "min_samples_start": 100, "n_covered_ex": 10, "binary_cache_size": 10000, "cache_margin": 1000, "verbose": false, "verbose_every": 1, "kwargs": {}}, "version": "0.9.1"}, "data": {"anchor": ["Marital Status = Never-Married", "Relationship = Own-child", "Capital Gain <= 0.00"], "precision": 0.9942028985507246, "coverage": 0.0657, "raw": {"feature": [3, 5, 8], "mean": [0.7914951989026063, 0.9400749063670412, 0.9942028985507246], "precision": [0.7914951989026063, 0.9400749063670412, 0.9942028985507246], "coverage": [0.3043, 0.069, 0.0657], "examples": [{"covered_true": [[30, 0, 1, 1, 0, 1, 1, 0, 0, 0, 50, 2], [49, 4, 2, 1, 6, 0, 4, 1, 0, 0, 60, 9], [39, 2, 5, 1, 5, 0, 4, 1, 0, 0, 40, 9], [33, 4, 2, 1, 5, 0, 4, 1, 0, 0, 40, 9], [63, 4, 1, 1, 8, 1, 4, 0, 0, 0, 40, 9], [23, 4, 1, 1, 7, 1, 4, 1, 0, 0, 66, 8], [45, 4, 1, 1, 8, 0, 1, 1, 0, 0, 40, 1], [54, 4, 1, 1, 8, 4, 4, 1, 0, 0, 45, 9], [32, 6, 1, 1, 8, 4, 2, 0, 0, 0, 30, 9], [40, 5, 1, 1, 2, 0, 4, 1, 0, 0, 40, 9]], "covered_false": [[57, 4, 5, 1, 5, 0, 4, 1, 0, 1977, 45, 9], [53, 0, 5, 1, 0, 1, 4, 0, 8614, 0, 35, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [53, 4, 5, 1, 8, 0, 4, 1, 0, 1977, 55, 9], [35, 4, 1, 1, 8, 0, 4, 1, 7688, 0, 50, 9], [32, 4, 1, 1, 5, 1, 4, 1, 0, 0, 40, 9], [42, 4, 1, 1, 5, 0, 4, 1, 99999, 0, 40, 9], [32, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 50, 9], [53, 7, 5, 1, 8, 0, 4, 1, 0, 0, 42, 9], [52, 1, 1, 1, 8, 0, 4, 1, 0, 0, 45, 9]], "uncovered_true": [], "uncovered_false": []}, {"covered_true": [[52, 7, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [27, 4, 1, 1, 8, 3, 4, 1, 0, 0, 40, 9], [28, 4, 1, 1, 6, 3, 4, 1, 0, 0, 60, 9], [46, 6, 5, 1, 2, 3, 4, 1, 0, 0, 50, 9], [53, 2, 5, 1, 5, 3, 2, 0, 0, 1669, 35, 9], [27, 4, 5, 1, 8, 3, 4, 0, 0, 0, 40, 9], [25, 4, 1, 1, 8, 3, 4, 0, 0, 0, 40, 9], [29, 6, 5, 1, 2, 3, 4, 1, 0, 0, 30, 9], [64, 0, 1, 1, 0, 3, 4, 1, 0, 0, 50, 9], [63, 0, 5, 1, 0, 3, 4, 1, 0, 0, 30, 9]], "covered_false": [[50, 5, 1, 1, 8, 3, 4, 1, 15024, 0, 60, 9], [45, 6, 1, 1, 6, 3, 4, 1, 14084, 0, 45, 9], [37, 4, 1, 1, 8, 3, 4, 1, 15024, 0, 40, 9], [33, 4, 1, 1, 8, 3, 4, 1, 15024, 0, 60, 9], [41, 6, 5, 1, 8, 3, 4, 1, 7298, 0, 70, 9], [42, 6, 1, 1, 2, 3, 4, 1, 15024, 0, 60, 9]], "uncovered_true": [], "uncovered_false": []}, {"covered_true": [[41, 4, 1, 1, 1, 3, 4, 1, 0, 0, 40, 9], [55, 2, 5, 1, 8, 3, 4, 1, 0, 0, 50, 9], [35, 4, 5, 1, 5, 3, 4, 0, 0, 0, 32, 9], [31, 4, 1, 1, 2, 3, 4, 1, 0, 0, 40, 9], [47, 4, 1, 1, 1, 3, 4, 1, 0, 0, 40, 9], [33, 4, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [58, 0, 1, 1, 0, 3, 4, 0, 0, 0, 50, 9], [44, 6, 1, 1, 2, 3, 4, 1, 0, 0, 90, 9], [30, 4, 1, 1, 6, 3, 4, 1, 0, 0, 40, 9], [25, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9]], "covered_false": [], "uncovered_true": [], "uncovered_false": []}], "all_precision": 0, "num_preds": 1000000, "success": true, "names": ["Marital Status = Never-Married", "Relationship = Own-child", "Capital Gain <= 0.00"], "prediction": [0], "instance": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], "instances": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}']}]}
    
    kubectl delete -f ../../piplines/income-production.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/income-preprocess.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/income.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/income-drift.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/income-outlier.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/income-explainer.yaml -n ${NAMESPACE}
    seldon pipeline unload income-production
    seldon model unload income-preprocess
    seldon model unload income
    seldon model unload income-drift
    seldon model unload income-outlier
    seldon model unload income-explainer
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: ServerConfig
    metadata:
      name: mlserver
    spec:
      podSpec:
        terminationGracePeriodSeconds: 120
        serviceAccountName: seldon-server
        containers:
        - image: rclone:latest
          imagePullPolicy: IfNotPresent
          name: rclone
          command:
            - rclone
          args:
            - rcd
            - --rc-no-auth
            - --config=/rclone/rclone.conf
            - --rc-addr=0.0.0.0:5572
            - --max-buffer-memory=$(MAX_BUFFER_MEMORY)
          env:
            - name: MAX_BUFFER_MEMORY
              value: "64M"
          ports:
          - containerPort: 5572
            name: rclone
            protocol: TCP
          lifecycle:
            preStop:
              httpGet:
                port: 9007
                path: terminate
          resources:
            limits:
              memory: '256M'
            requests:
              cpu: "200m"
              memory: '100M'
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            tcpSocket:
              port: 5572
            timeoutSeconds: 1
          livenessProbe:
            failureThreshold: 1
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            exec:
              command:
                - rclone
                - rc
                - rc/noop
            timeoutSeconds: 1
          volumeMounts:
          - mountPath: /mnt/agent
            name: mlserver-models
        - image: agent:latest
          imagePullPolicy: IfNotPresent
          command:
            - /bin/agent
          args:
            - --tracing-config-path=/mnt/tracing/tracing.json
          name: agent
          env:
          - name: SELDON_SERVER_CAPABILITIES
            value: "mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost"
          - name: SELDON_OVERCOMMIT_PERCENTAGE
            value: "10"
          - name: SELDON_MODEL_INFERENCE_LAG_THRESHOLD
            value: "30"
          - name: SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD
            value: "600"
          - name: SELDON_SCALING_STATS_PERIOD_SECONDS
            value: "20"
          - name: SELDON_SERVER_HTTP_PORT
            value: "9000"
          - name: SELDON_SERVER_GRPC_PORT
            value: "9500"
          - name: SELDON_REVERSE_PROXY_HTTP_PORT
            value: "9001"
          - name: SELDON_REVERSE_PROXY_GRPC_PORT
            value: "9501"
          - name: SELDON_SCHEDULER_HOST
            value: "seldon-scheduler"
          - name: SELDON_SCHEDULER_PORT
            value: "9005"
          - name: SELDON_SCHEDULER_TLS_PORT
            value: "9055"
          - name: SELDON_METRICS_PORT
            value: "9006"
          - name: SELDON_DRAINER_PORT
            value: "9007"
          - name: SELDON_READINESS_PORT
            value: "9008"
          - name: AGENT_TLS_SECRET_NAME
            value: ""
          - name: AGENT_TLS_FOLDER_PATH
            value: ""
          - name: SELDON_SERVER_TYPE
            value: "mlserver"
          - name: SELDON_ENVOY_HOST
            value: "seldon-mesh"
          - name: SELDON_ENVOY_PORT
            value: "80"
          - name: SELDON_LOG_LEVEL
            value: "warn"
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: MEMORY_REQUEST
            valueFrom:
              resourceFieldRef:
                containerName: mlserver
                resource: requests.memory
          - name: SELDON_USE_DEPLOYMENTS_FOR_SERVERS
            value: "false"
          ports:
          - containerPort: 9501
            name: grpc
            protocol: TCP
          - containerPort: 9001
            name: http
            protocol: TCP
          - containerPort: 9006
            name: metrics
            protocol: TCP
          - containerPort: 9008
            name: readiness-port
          lifecycle:
            preStop:
              httpGet:
                port: 9007
                path: terminate
          readinessProbe:
            httpGet:
              path: /ready
              port: 9008
            failureThreshold: 1
            periodSeconds: 5
          startupProbe:
            httpGet:
              path: /ready
              port: 9008
            failureThreshold: 60
            periodSeconds: 15
          resources:
            requests:
              cpu: "500m"
              memory: '500M'
          volumeMounts:
          - mountPath: /mnt/agent
            name: mlserver-models
          - name: config-volume
            mountPath: /mnt/config
          - name: tracing-config-volume
            mountPath: /mnt/tracing
        - image: mlserver:latest
          imagePullPolicy: IfNotPresent
          env:
          - name: MLSERVER_HTTP_PORT
            value: "9000"
          - name: MLSERVER_GRPC_PORT
            value: "9500"
          - name: MLSERVER_MODELS_DIR
            value: "/mnt/agent/models"
          - name: MLSERVER_PARALLEL_WORKERS
            value: "1"
          - name: MLSERVER_LOAD_MODELS_AT_STARTUP
            value: "false"
          - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
            value: "1048576000" # 100MB (100 * 1024 * 1024)
          resources:
            requests:
              cpu: 1
              memory: '1G'
          lifecycle:
            preStop:
              httpGet:
                port: 9007
                path: terminate
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: server-http
          readinessProbe:
            httpGet:
              path: /v2/health/live
              port: server-http
            initialDelaySeconds: 5
            periodSeconds: 5
          startupProbe:
            httpGet:
              path: /v2/health/live
              port: server-http
            failureThreshold: 10
            periodSeconds: 10
          name: mlserver
          ports:
          - containerPort: 9500
            name: server-grpc
            protocol: TCP
          - containerPort: 9000
            name: server-http
            protocol: TCP
          - containerPort: 8082
            name: server-metrics
          volumeMounts:
          - mountPath: /mnt/agent
            name: mlserver-models
            readOnly: true
          - mountPath: /mnt/certs
            name: downstream-ca-certs
            readOnly: true
        securityContext:
          fsGroup: 2000
          runAsUser: 1000
          runAsNonRoot: true
        volumes:
        - name: config-volume
          configMap:
            name: seldon-agent
        - name: tracing-config-volume
          configMap:
            name: seldon-tracing
        - name: downstream-ca-certs
          secret:
            secretName: seldon-downstream-server
            optional: true
      volumeClaimTemplates:
      - name: mlserver-models
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Gi
    

    security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers

    Client ID, client secret and token endpoint url should come from identity provider such as Keycloak or Azure AD.

    Values in Seldon Core 2

    In Seldon Core 2 you need to specify these values:

    • security.kafka.sasl.mechanism - set to OAUTHBEARER

    • security.kafka.sasl.client.secret - Created secret with client credentials

    • security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka brokers

    The resulting set of values in components-values.yaml is similar to:

    The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.

    A root certificate, referred to as ca.crt.

    These certificates are expected to be encoded as PEM certificates and are provided through a secret, which can be created in teh namespace seldon:

    This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

    Ensure that the field used within the secret follow the same naming convention: tls.crt, tls.key and ca.crt. Otherwise, Seldon Core 2 may not be able to find the correct set of certificates.

    Reference these certificates it within the corresponding Helm values for Seldon Core 2 installation.

    Values for Seldon Core 2 In Seldon Core 2 you need to specify these values:

    • security.kafka.ssl.client.secret - Secret name containing client certificates

    • security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers

    The resulting set of values in values.yaml is similar to:

    The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.

    , and
    bootstrap.servers
    from the configuration file.

    Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.

    1. Create a secret named confluent-kafka-sasl for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with with the value of Secret that you generated in Confluent cloud.

    Obtain these details from Confluent Cloud:
    • Cluster ID: Cluster Overview → Cluster Settings → General → Identification

    • Identity Pool ID: Accounts & access → Identity providers → .

  • Obtain these details from your identity providers such as Keycloak or Azure AD.

    • Client ID

    • Client secret

    • Token Endpoint URL

  • If you are using Azure AD you may will need to set scope: api://<client id>/.default.

    Creating Kubernetes secret

    1. Create Kubernetes secrets to store the required client credentials. For example, create a kafka-secret.yaml file by replacing the values of <client id>, <client secret>, <token endpoint url>, <cluster id>,<identity pool id> with the values that you obtained from Confluent Cloud and your identity provider.

    2. Provide the secret named confluent-kafka-oauth in the seldon namespace to configure with Seldon Core 2.

      This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

    replace <username> with thevalue of Key that you generated in Confluent Cloud.
  • replace <confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.

  • Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

    replace <confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.

    Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

    values.yaml
    Azure quickstart documentation
    Get an Event Hubs connection string
    Confluent documentation

    agent.modelInferenceLagThreshold

    components

    Queue lag threshold to trigger scaling up of a model replica.

    30

    agent.modelInactiveSecondsThreshold

    components

    Period with no activity after which to trigger scaling down of a model replica.

    600

    autoscaling.serverPackingEnabled

    components

    Whether packing of models onto fewer servers is enabled.

    false

    autoscaling.serverPackingPercentage

    components

    Percentage of events where packing is allowed. Higher values represent more aggressive packing. This is only used when serverPackingEnabled is set. Range is from 0.0 to 1.0

    0.0

    agent.maxUnloadElapsedTimeMinutes

    components

    Max time allowed for one model unload command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.

    15

    agent.maxUnloadRetryCount

    components

    Max number of retries for unsuccessful unload command for a model on a particular server replica. Lower values allow control plane commands to fail faster.

    5

    agent.unloadGracePeriodSeconds

    components

    A period guarding against race conditions between Envoy actually applying the cluster change to remove a route and before proceeding with the model replica unloading command.

    2

    dataflow.logLevelKafka

    components

    check klogging level

    scheduler.logLevel

    components

    check logrus log level

    modelgateway.logLevel

    components

    check logrus log level

    pipelinegateway.logLevel

    components

    check logrus log level

    hodometer.logLevel

    components

    check logrus log level

    serverConfig.rclone.logLevel

    components

    check rclone log-level

    serverConfig.agent.logLevel

    components

    check logrus log level

    envoy.preStopSleepPeriodSeconds

    components

    Sleep after calling prestop command.

    30

    envoy.terminationGracePeriodSeconds

    components

    Grace period to wait for prestop to finish for Envoy pods.

    120

    envoy.enableAccesslog

    components

    Whether to enable logging of requests.

    true

    envoy.accesslogPath

    components

    Path on disk to store logfile. This is only used when enableAccesslog is set.

    /tmp/envoy-accesslog.txt

    envoy.includeSuccessfulRequests

    components

    Whether to including successful requests. If set to false, then only failed requests are logged. This is only used when enableAccesslog is set.

    autoscaling.autoscalingModelEnabled

    components

    Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.

    false

    autoscaling.autoscalingServerEnabled

    components

    Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.

    true

    agent.scalingStatsPeriodSeconds

    components

    Sampling rate for metrics used for autoscaling.

    serverConfig.terminationGracePeriodSeconds

    components

    Grace period to wait for prestop process to finish for this particular Server pod.

    120

    agent.overcommitPercentage

    components

    Overcommit percentage (of memory) allowed. Range is from 0 to 100

    10

    agent.maxLoadElapsedTimeMinutes

    components

    Max time allowed for one model load command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.

    120

    agent.maxLoadRetryCount

    components

    Max number of retries for unsuccessful load command for a model on a particular server replica. Lower values allow control plane commands to fail faster.

    logging.logLevel

    components

    Components wide settings for logging level, if individual component levels are not set. Options are: debug, info, error.

    info

    controller.logLevel

    components

    check zap log level here

    dataflow.logLevel

    components

    false

    20

    5

    check klogging level

    Model Autoscaling with HPA

    Configuring HPA manifests

    Once metrics for custom autoscaling are configured (see ), Kubernetes resources, including Models, can be autoscaled using HPA by applying an HPA manifest that targets the chosen scaling metric.

    Consider a model named irisa0 with the following manifest. Please note we don't set minReplicas/maxReplicas in order to disable the Seldon inference-lag-based autoscaling so that it doesn't interact with HPA (separate minReplicas/maxReplicas configs will be set on the HPA side)

    Scalability of Pipelines

    Core 2 supports full horizontal scaling for the dataflow engine, model gateway, and pipeline gateway. Each service automatically distributes pipelines or models across replicas using consistent hashing, so you don’t need to manually assign workloads.

    This guide explains how scaling works, what configuration controls it, and what happens when replicas or pipelines/models change.

    1. How scaling works (at a glance)

    Inference Server

    Learn how to perform model inference in Seldon Core using REST and gRPC protocols, including request/response formats and client examples.

    This section will discuss how to make inference calls against your Seldon models or pipelines.

    You can make synchronous inference requests via REST or gRPC or asynchronous requests via Kafka topics. The content of your request should be an :

    • REST payloads will generally be in the JSON v2 protocol format.

    • gRPC and Kafka payloads must be in the Protobuf v2 protocol format.

    Server Live

    get

    The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes

    Server Ready

    get

    The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.

    Responses
    200

    OK

    No content

    get
    /v2/health/ready

    Model Ready

    get

    The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies. The endpoint model readiness endpoints report that an individual model is loaded and ready to serve. It is intended only to give customers visibility into the model’s state and is not intended to be used as a Kubernetes readiness probe for the MLServer container. If you use a model-specific health endpoint for the container readiness probe this can cause a deadlock based on the current implementation, because - the Seldon agent does not begin model download until the Pod’s IP is visible in endpoints; – the IP of the Pod is only published after the Pod is Ready or all internal readiness checks have passed; – the MLServer container only becomes Ready once the model is loaded. This would result in the agent never downloading the model and the Pod never becoming Ready. For container-level readiness checks we recommend the server-level readiness endpoints. These indicate that the MLServer process is up and accepting health checks and does not deadlock the agent/model loading flow.

    Server Metadata

    get

    The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint.

    Responses
    200

    OK

    application/json
    400

    Bad Request

    application/json
    get
    /v2

    Model Metadata

    get

    The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    Path parameters
    model_namestringRequired
    model_versionstringRequired

    Model Metadata

    get

    The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    Path parameters
    model_namestringRequired
    Responses
    200

    OK

    application/json
    404

    Not Found

    application/json

    Model Inference

    post

    An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    Path parameters
    model_namestringRequired
    model_versionstringRequired

    Model Inference

    post

    An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    Path parameters
    model_namestringRequired
    Body
    apiVersion: v1
    kind: Secret
    metadata:
      name: confluent-kafka-oauth
    type: Opaque
    stringData:
      method: OIDC
      client_id: <client id>
      client_secret: <client secret>
      token_endpoint_url: <token endpoint url>
      extensions: logicalCluster=<cluster id>,identityPoolId=<identity pool id>
      scope: ""
    kubectl create secret generic kafka-sasl-secret --from-literal password=<kafka-password> -n seldon-mesh
    security:
      kafka:
        protocol: SASL_SSL
        sasl:
          mechanism: SCRAM-SHA-512
          client:
            username: <kafka-username>      # TODO: Replace with your Kafka username
            secret: kafka-sasl-secret       # NOTE: Secret name from previous step
        ssl:
          client:
            secret:                                   # NOTE: Leave empty
            brokerValidationSecret: kafka-broker-tls  # NOTE: Optional
    apiVersion: v1
    kind: Secret
    metadata:
      name: kafka-oauth
    type: Opaque
    stringData:
      method: OIDC
      client_id: <client id>
      client_secret: <client secret>
      token_endpoint_url: <token endpoint url>
      extensions: ""
      scope: ""
    kubectl apply -f kafka-oauth.yaml -n seldon-mesh
    Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX
    kubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldon
    kubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldon-system
    controller:
      clusterwide: true
    
    dataflow:
      resources:
        cpu: 500m
    
    envoy:
      service:
        type: ClusterIP
    
    kafka:
      bootstrap: <namespace>.servicebus.windows.net:9093
      topics:
        replicationFactor: 3
        numPartitions: 4    
    security:
      kafka:
        protocol: SASL_SSL
        sasl:
          mechanism: "PLAIN"
          client:
            username: $ConnectionString
            secret: azure-kafka-secret
        ssl:
          client:
            secret:
            brokerValidationSecret:
            
    opentelemetry:
      enable: false
    
    scheduler:
      service:
        type: ClusterIP
    
    serverConfig:
      mlserver:
        resources:
          cpu: 1
          memory: 2Gi
    
      triton:
        resources:
          cpu: 1
          memory: 2Gi
    
    serviceGRPCPrefix: "http2-"
    security:
      kafka:
        protocol: SASL_SSL
        sasl:
          mechanism: OAUTHBEARER
          client:
              secret: kafka-oauth                     # NOTE: Secret name from earlier step
        ssl:
          client:
            secret:                                   # NOTE: Leave empty
            brokerValidationSecret: kafka-broker-tls  # NOTE: Optional
    kubectl create secret generic kafka-client-tls -n seldon \
      --from-file ./tls.crt \
      --from-file ./tls.key \
      --from-file ./ca.crt
    security:
      kafka:
        protocol: SSL
        ssl:
          client:
            secret: kafka-client-tls                  # NOTE: Secret name from earlier step
            brokerValidationSecret: kafka-broker-tls  # NOTE: Optional
    kubectl create secret generic confluent-kafka-sasl --from-literal password="<password>" -n seldon
    controller:
      clusterwide: true
    
    dataflow:
      resources:
        cpu: 500m
    
    envoy:
      service:
        type: ClusterIP
    
    kafka:
      bootstrap: <confluent-endpoints>
      topics:
        replicationFactor: 3
        numPartitions: 4
      consumer:
        messageMaxBytes: 8388608
      producer:
        messageMaxBytes: 8388608
    
    security:
      kafka:
        protocol: SASL_SSL
        sasl:
          mechanism: "PLAIN"
          client:
            username: <username>
            secret: confluent-kafka-sasl
        ssl:
          client:
            secret:
            brokerValidationSecret:
    
    opentelemetry:
      enable: false
    
    scheduler:
      service:
        type: ClusterIP
    
    serverConfig:
      mlserver:
        resources:
          cpu: 1
          memory: 2Gi
    
      triton:
        resources:
          cpu: 1
          memory: 2Gi
    
    serviceGRPCPrefix: "http2-"
    controller:
      clusterwide: true
    
    dataflow:
      resources:
        cpu: 500m
    
    envoy:
      service:
        type: ClusterIP
    
    kafka:
      bootstrap: <confluent-endpoints>
      topics:
        replicationFactor: 3
        numPartitions: 4
      consumer:
        messageMaxBytes: 8388608
      producer:
        messageMaxBytes: 8388608
    
    security:
      kafka:
        protocol: SASL_SSL
        sasl:
          mechanism: OAUTHBEARER
          client:
            secret: confluent-kafka-oauth
        ssl:
          client:
            secret:
            brokerValidationSecret:
    
    opentelemetry:
      enable: false
    
    scheduler:
      service:
        type: ClusterIP
    
    serverConfig:
      mlserver:
        resources:
          cpu: 1
          memory: 2Gi
    
      triton:
        resources:
          cpu: 1
          memory: 2Gi
    
    serviceGRPCPrefix: "http2-"
    here
    here
    here
    here
    here
    here
    here
    here
    https://github.com/SeldonIO/seldon-core/blob/v2/operator/config/seldonconfigs/default.yaml
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/pipelines/tfsimples-join.yaml
    You must also explicitly define a value for spec.replicas. This is the key modified by HPA to increase the number of replicas, and if not present in the manifest it will result in HPA not working until the Model CR is modified to have spec.replicas defined.

    Let's scale this model when it is deployed on a server named mlserver, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):

    If a Model gets scaled up slightly before its corresponding Server, the model is currently marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new Server replicas become available. However, the existing replicas of that model remain available and will continue to serve inference load.

    The Object metric allows for two target value types: AverageValue and Value. Of the two, only AverageValue is supported for the current Seldon Core 2 setup. The Value target type is typically used for metrics describing the utilization of a resource and would not be suitable for RPS-based scaling.

    HPA metrics of type Object

    The example HPA manifests use metrics of type "Object" that fetch the data used in scaling decisions by querying k8s metrics associated with a particular k8s object. The endpoints that HPA uses for fetching those metrics are the same ones that were tested in the previous section using kubectl get --raw .... Because you have configured the Prometheus Adapter to expose those k8s metrics based on queries to Prometheus, a mapping exists between the information contained in the HPA Object metric definition and the actual query that is executed against Prometheus. This section aims to give more details on how this mapping works.

    In our example, the metric.name:infer_rps gets mapped to the seldon_model_infer_total metric on the prometheus side, based on the configuration in the name section of the Prometheus Adapter ConfigMap. The prometheus metric name is then used to fill in the <<.Series>> template in the query (metricsQuery in the same ConfigMap).

    Then, the information provided in the describedObject is used within the Prometheus query to select the right aggregations of the metric. For the RPS metric used to scale the Model (and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across all the replicas of a given model, so the describedObject references a specific Model CR.

    However, in the general case, the describedObject does not need to be a Model. Any k8s object listed in the resources section of the Prometheus Adapter ConfigMap may be used. The Prometheus label associated with the object kind fills in the <<.GroupBy>> template, while the name gets used as part of the <<.LabelMatchers>>. For example:

    • If the described object is { kind: Namespace, name: seldon-mesh }, then the Prometheus query template configured in our example would be transformed into:

    • If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:

    The target section establishes the thresholds used in scaling decisions. For RPS, the AverageValue target type refers to the threshold per replica RPS above which the number of the scaleTargetRef (Model or Server) replicas should be increased. The target number of replicas is being computed by HPA according to the following formula:

    targetReplicas=infer_rpsaverageValue\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{averageValue}}targetReplicas=averageValueinfer_rps​

    As an example, if averageValue=50 and infer_rps=150, the targetReplicas would be 3.

    Importantly, computing the target number of replicas does not require knowing the number of active pods currently associated with the Server or Model. This is what allows both the Model and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would attempt to take ownership of the same set of pods, and transition into a failure state.

    This is also why the Value target type is not currently supported. In this case, HPA first computes an utilizationRatio:

    utilizationRatio=custom_metric_valuethreshold_value\texttt{utilizationRatio} = \frac{\texttt{custom\_metric\_value}}{\texttt{threshold\_value}}utilizationRatio=threshold_valuecustom_metric_value​

    As an example, if threshold_value=100 and custom_metric_value=200, the utilizationRatio would be 2. HPA deduces from this that the number of active pods associated with the scaleTargetRef object should be doubled, and expects that once that target is achieved, the custom_metric_value will become equal to the threshold_value (utilizationRatio=1). However, by using the number of active pods, the HPA CRs for both the Model and the Server also try to take exclusive ownership of the same set of pods, and fail.

    Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds). When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value][target]. This information is updated in accordance to the sampling rate of each HPA resource.

    Next Steps: Autoscaling Servers

    Once Model autoscaling is set up (either through HPA, or by Seldon Core), users will need to configure Server autoscaling. You can use Seldon Core's native autoscaling functionality for Servers here.

    Otherwise, if you want to scale Servers using HPA as well - this only works in a setup where all Models and Servers have a 1-1 maping - you will also need to set up HPA manifests for Servers. This is explained in more detail here.

    Advanced settings

    • Filtering metrics by additional labels on the prometheus metric - The prometheus metric from which the model RPS is computed has the following labels managed by Seldon Core 2:

      If you want the scaling metric to be computed based on a subset of the Prometheus time series with particular label values (labels either managed by Seldon Core 2 or added automatically within your infrastructure), you can add this as a selector the HPA metric config. This is shown in the following example, which scales only based on the RPS of REST requests as opposed to REST + gRPC:

    • Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs

    • For more resources, please consult the HPA docs and the HPA walkthrough

    Cluster operation guidelines when using HPA-based scaling

    When deploying HPA-based scaling for Seldon Core 2 models and servers as part of a production deployment, it is important to understand the exact interactions between HPA-triggered actions and Seldon Core 2 scheduling, as well as potential pitfalls in choosing particular HPA configurations.

    Using the default scaling policy, HPA is relatively aggressive on scale-up (responding quickly to increases in load), with a maximum replicas increase of either 4 every 15 seconds or 100% of existing replicas within the same period (whichever is highest). In contrast, scaling-down is more gradual, with HPA only scaling down to the maximum number of recommended replicas in the most recent 5 minute rolling window, in order to avoid flapping. Those parameters can be customized via scaling policies.

    When using custom metrics such as RPS, the actual number of replicas added during scale-up or reduced during scale-down will entirely depend, alongside the maximums imposed by the policy, on the configured target (averageValue RPS per replica) and on how quickly the inferencing load varies in your cluster. All three need to be considered jointly in order to deliver both an efficient use of resources and meeting SLAs.

    Customizing per-replica RPS targets and replica limits

    Naturally, the first thing to consider is an estimated peak inference load (including some margins) for each of the models in the cluster. If the minimum number of model replicas needed to serve that load without breaching latency SLAs is known, it should be set as spec.maxReplicas, with the HPA target.averageValue set to peak_infer_RPS/maxReplicas.

    If maxReplicas is not already known, an open-loop load test with a slowly ramping up request rate should be done on the target model (one replica, no scaling). This would allow you to determine the RPS (inference request throughput) when latency SLAs are breached or (depending on the desired operation point) when latency starts increasing. You would then set the HPA target.averageValue taking some margin below this saturation RPS, and compute spec.maxReplicas as peak_infer_RPS/target.averageValue. The margin taken below the saturation point is very important, because scaling-up cannot be instant (it requires spinning up new pods, downloading model artifacts, etc.). In the period until the new replicas become available, any load increases will still need to be absorbed by the existing replicas.

    If there are multiple models which typically experience peak load in a correlated manner, you need to ensure that sufficient cluster resources are available for k8s to concurrently schedule the maximum number of server pods, with each pod holding one model replica. This can be ensured by using either Cluster Autoscaler or, when running workloads in the cloud, any provider-specific cluster autoscaling services.

    It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.

    Not having sufficient cluster resources to serve the number of replicas configured by HPA at a given moment, in particular under aggressive scale-up HPA policies, may result in breaches of SLAs. This is discussed in more detail in the following section.

    A similar approach should be taken for setting minReplicas, in relation to estimated RPS in the low-load regime. However, it's useful to balance lower resource usage to immediate availability of replicas for inference rate increases from that lowest load point. If low-load regimes only occur for small periods of time, and especially combined with a high rate of increase in RPS when moving out of the low-load regime, it might be worth to set the minReplicas floor higher in order to ensure SLAs are met at all times.

    Configuring Scaling Parameters

    The following elements are important to take into account when setting the HPA policies for models:

    • The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.

      • Say you configures a scale-up stabilization window of one minute. This means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.

      • Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.

    • The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.

      • It is useful to consider whether the replica scale-up rate configured via the policy is able to keep-up with this RPS increase rate.

      • Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one

    HPA Setup
    Responses
    200

    OK

    application/json
    404

    Not Found

    application/json
    get
    /v2/models/{model_name}/versions/{model_version}
    get
    /v2/models/{model_name}
    apiVersion: mlops.seldon.io/v1alpha1
    kind: SeldonConfig
    metadata:
      name: default
    spec:
      components:
      - name: seldon-dataflow-engine
        replicas: 1
        podSpec:
          containers:
          - env:
            - name: SELDON_UPSTREAM_HOST
              value: seldon-scheduler
            - name: SELDON_UPSTREAM_PORT
              value: "9008"
            - name: OTEL_JAVAAGENT_ENABLED
              valueFrom:
                configMapKeyRef:
                  key: OTEL_JAVAAGENT_ENABLED
                  name: seldon-tracing
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              valueFrom:
                configMapKeyRef:
                  key: OTEL_EXPORTER_OTLP_ENDPOINT
                  name: seldon-tracing
            - name: OTEL_EXPORTER_OTLP_PROTOCOL
              valueFrom:
                configMapKeyRef:
                  key: OTEL_EXPORTER_OTLP_PROTOCOL
                  name: seldon-tracing
            - name: SELDON_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            image: seldonio/seldon-dataflow-engine:latest
            imagePullPolicy: Always
            name: dataflow-engine
            resources:
              limits:
                memory: 3G
              requests:
                cpu: 100m
                memory: 1G
            ports:
              - containerPort: 8000
                name: health
            startupProbe:
              failureThreshold: 10
              httpGet:
                path: /startup
                port: health
              initialDelaySeconds: 10
              periodSeconds: 10
            readinessProbe:
              failureThreshold: 3
              httpGet:
                path: /ready
                port: health
              periodSeconds: 5
            livenessProbe:
              failureThreshold: 3
              httpGet:
                path: /live
                port: health
              periodSeconds: 5
            volumeMounts:
              - mountPath: /mnt/schema-registry
                name: kafka-schema-volume
                readOnly: true
          serviceAccountName: seldon-scheduler
          terminationGracePeriodSeconds: 5
          volumes:
            - secret:
                secretName: confluent-schema
                optional: true
              name: kafka-schema-volume
      - name: seldon-envoy
        replicas: 1
        annotations:
            "prometheus.io/path": "/stats/prometheus"
            "prometheus.io/port": "9003"
            "prometheus.io/scrape": "true"
        podSpec:
          containers:
          - image: seldonio/seldon-envoy:latest
            imagePullPolicy: Always
            name: envoy
            ports:
            - containerPort: 9000
              name: http
            - containerPort: 9003
              name: envoy-stats
            resources:
              limits:
                memory: 128Mi
              requests:
                cpu: 100m
                memory: 128Mi
            readinessProbe:
              httpGet:
                path: /ready
                port: envoy-stats
              initialDelaySeconds: 10
              periodSeconds: 5
              failureThreshold: 3
          terminationGracePeriodSeconds: 5
      - name: hodometer
        replicas: 1
        podSpec:
          containers:
          - env:
            - name: PUBLISH_URL
              value: http://hodometer.seldon.io
            - name: SCHEDULER_HOST
              value: seldon-scheduler
            - name: SCHEDULER_PLAINTXT_PORT
              value: "9004"
            - name: SCHEDULER_TLS_PORT
              value: "9044"
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            image: seldonio/seldon-hodometer:latest
            imagePullPolicy: Always
            name: hodometer
            resources:
              limits:
                memory: 32Mi
              requests:
                cpu: 1m
                memory: 32Mi
          serviceAccountName: hodometer
          terminationGracePeriodSeconds: 5
      - name: seldon-modelgateway
        replicas: 1
        podSpec:
          containers:
          - args:
            - --scheduler-host=seldon-scheduler
            - --scheduler-plaintxt-port=$(SELDON_SCHEDULER_PLAINTXT_PORT)
            - --scheduler-tls-port=$(SELDON_SCHEDULER_TLS_PORT)
            - --envoy-host=seldon-mesh
            - --envoy-port=80
            - --kafka-config-path=/mnt/kafka/kafka.json
            - --tracing-config-path=/mnt/tracing/tracing.json
            - --log-level=$(LOG_LEVEL)
            - --health-probe-port=$(HEALTH_PROBE_PORT)
            command:
            - /bin/modelgateway
            env:
            - name: SELDON_SCHEDULER_PLAINTXT_PORT
              value: "9004"
            - name: SELDON_SCHEDULER_TLS_PORT
              value: "9044"
            - name: MODELGATEWAY_MAX_NUM_CONSUMERS
              value: "100"
            - name: LOG_LEVEL
              value: "warn"
            - name: HEALTH_PROBE_PORT
              value: "9999"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            image: seldonio/seldon-modelgateway:latest
            imagePullPolicy: Always
            name: modelgateway
            resources:
              limits:
                memory: 1G
              requests:
                cpu: 100m
                memory: 1G
            ports:
              - containerPort: 9999
                name: health
                protocol: TCP
            startupProbe:
              httpGet:
                path: /startup
                port: health
              initialDelaySeconds: 10
              periodSeconds: 10
              failureThreshold: 10
            readinessProbe:
              httpGet:
                path: /ready
                port: health
              periodSeconds: 5
              failureThreshold: 3
            livenessProbe:
              httpGet:
                path: /live
                port: health
              periodSeconds: 5
              failureThreshold: 3
            volumeMounts:
            - mountPath: /mnt/kafka
              name: kafka-config-volume
            - mountPath: /mnt/tracing
              name: tracing-config-volume
            - mountPath: /mnt/schema-registry
              name: kafka-schema-volume
              readOnly: true
          serviceAccountName: seldon-scheduler
          terminationGracePeriodSeconds: 5
          volumes:
          - configMap:
              name: seldon-kafka
            name: kafka-config-volume
          - configMap:
              name: seldon-tracing
            name: tracing-config-volume
          - secret:
              secretName: confluent-schema
              optional: true
            name: kafka-schema-volume
      - name: seldon-pipelinegateway
        replicas: 1
        podSpec:
          containers:
          - args:
            - --http-port=9010
            - --grpc-port=9011
            - --metrics-port=9006
            - --scheduler-host=seldon-scheduler
            - --scheduler-plaintxt-port=$(SELDON_SCHEDULER_PLAINTXT_PORT)
            - --scheduler-tls-port=$(SELDON_SCHEDULER_TLS_PORT)
            - --envoy-host=seldon-mesh
            - --envoy-port=80
            - --kafka-config-path=/mnt/kafka/kafka.json
            - --tracing-config-path=/mnt/tracing/tracing.json
            - --log-level=$(LOG_LEVEL)
            - --health-probe-port=$(HEALTH_PROBE_PORT)
            command:
            - /bin/pipelinegateway
            env:
            - name: SELDON_SCHEDULER_PLAINTXT_PORT
              value: "9004"
            - name: SELDON_SCHEDULER_TLS_PORT
              value: "9044"
            - name: LOG_LEVEL
              value: "warn"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: HEALTH_PROBE_PORT
              value: "9999"
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            image: seldonio/seldon-pipelinegateway
            imagePullPolicy: Always
            name: pipelinegateway
            ports:
            - containerPort: 9010
              name: http
              protocol: TCP
            - containerPort: 9011
              name: grpc
              protocol: TCP
            - containerPort: 9006
              name: metrics
              protocol: TCP
            - containerPort: 9999
              name: health
              protocol: TCP
            resources:
              limits:
                memory: 1G
              requests:
                cpu: 100m
                memory: 1G
            startupProbe:
              httpGet:
                path: /startup
                port: health
              initialDelaySeconds: 10
              periodSeconds: 10
              failureThreshold: 10
            readinessProbe:
              httpGet:
                path: /ready
                port: health
              periodSeconds: 5
              failureThreshold: 3
            livenessProbe:
              httpGet:
                path: /live
                port: health
              periodSeconds: 5
              failureThreshold: 3
            volumeMounts:
            - mountPath: /mnt/kafka
              name: kafka-config-volume
            - mountPath: /mnt/tracing
              name: tracing-config-volume
            - mountPath: /mnt/schema-registry
              name: kafka-schema-volume
              readOnly: true
          serviceAccountName: seldon-scheduler
          terminationGracePeriodSeconds: 5
          volumes:
          - configMap:
              name: seldon-kafka
            name: kafka-config-volume
          - configMap:
              name: seldon-tracing
            name: tracing-config-volume
          - secret:
              secretName: confluent-schema
              optional: true
            name: kafka-schema-volume
      - name: seldon-scheduler
        replicas: 1
        podSpec:
          containers:
          - args:
            - --pipeline-gateway-host=seldon-pipelinegateway
            - --tracing-config-path=/mnt/tracing/tracing.json
            - --db-path=/mnt/scheduler/db
            - --allow-plaintxt=$(ALLOW_PLAINTXT)
            - --kafka-config-path=/mnt/kafka/kafka.json
            - --scaling-config-path=/mnt/scaling/scaling.yaml
            - --scheduler-ready-timeout-seconds=$(SCHEDULER_READY_TIMEOUT_SECONDS)
            - --server-packing-enabled=$(SERVER_PACKING_ENABLED)
            - --server-packing-percentage=$(SERVER_PACKING_PERCENTAGE)
            - --envoy-accesslog-path=$(ENVOY_ACCESSLOG_PATH)
            - --enable-envoy-accesslog=$(ENABLE_ENVOY_ACCESSLOG)
            - --include-successful-requests-envoy-accesslog=$(INCLUDE_SUCCESSFUL_REQUESTS_ENVOY_ACCESSLOG)
            - --enable-model-autoscaling=$(ENABLE_MODEL_AUTOSCALING)
            - --enable-server-autoscaling=$(ENABLE_SERVER_AUTOSCALING)
            - --log-level=$(LOG_LEVEL)
            - --health-probe-port=$(HEALTH_PROBE_PORT)
            - --enable-pprof=$(ENABLE_PPROF)
            - --pprof-port=$(PPROF_PORT)
            - --pprof-block-rate=$(PPROF_BLOCK_RATE)
            - --pprof-mutex-rate=$(PPROF_MUTEX_RATE)
            - --retry-creating-failed-pipelines-tick=$(RETRY_CREATING_FAILED_PIPELINES_TICK)
            - --retry-deleting-failed-pipelines-tick=$(RETRY_DELETING_FAILED_PIPELINES_TICK)
            - --max-retry-failed-pipelines=$(MAX_RETRY_FAILED_PIPELINES)
            command:
            - /bin/scheduler
            env:
            - name: ALLOW_PLAINTXT
              value: "true"
            - name: SCHEDULER_READY_TIMEOUT_SECONDS
              value: 600
            - name: SERVER_PACKING_ENABLED
              value: "false"
            - name: SERVER_PACKING_PERCENTAGE
              value: "0.0"
            - name: ENVOY_ACCESSLOG_PATH
              value: /tmp/envoy-accesslog.txt
            - name: ENABLE_ENVOY_ACCESSLOG
              value: "true"
            - name: INCLUDE_SUCCESSFUL_REQUESTS_ENVOY_ACCESSLOG
              value: "false"
            - name: ENABLE_MODEL_AUTOSCALING
              value: "false"
            - name: ENABLE_SERVER_AUTOSCALING
              value: "true"
            - name: LOG_LEVEL
              value: "warn"
            - name: MODELGATEWAY_MAX_NUM_CONSUMERS
              value: "100"
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: HEALTH_PROBE_PORT
              value: "9999"
            - name: ENABLE_PPROF
              value: "false"
            - name: PPROF_PORT
              value: "6060"
            - name: PPROF_BLOCK_RATE
              value: "0"
            - name: PPROF_MUTEX_RATE
              value: "0"
            - name: RETRY_CREATING_FAILED_PIPELINES_TICK
              value: "60s"
            - name: RETRY_DELETING_FAILED_PIPELINES_TICK
              value: "60s"
            - name: MAX_RETRY_FAILED_PIPELINES
              value: "10"
            image: seldonio/seldon-scheduler:latest
            imagePullPolicy: Always
            name: scheduler
            ports:
            - containerPort: 9002
              name: xds
            - containerPort: 9004
              name: scheduler
            - containerPort: 9044
              name: scheduler-mtls
            - containerPort: 9005
              name: agent
            - containerPort: 9055
              name: agent-mtls
            - containerPort: 9008
              name: dataflow
            - containerPort: 9999
              name: health
              protocol: TCP
            readinessProbe:
              httpGet:
                path: /ready
                port: health
              periodSeconds: 5
              failureThreshold: 3
              initialDelaySeconds: 10
            livenessProbe:
              httpGet:
                path: /live
                port: health
              periodSeconds: 5
              failureThreshold: 3
              initialDelaySeconds: 10
            resources:
              limits:
                memory: 1G
              requests:
                cpu: 100m
                memory: 1G
            volumeMounts:
            - mountPath: /mnt/kafka
              name: kafka-config-volume
            - mountPath: /mnt/scaling
              name: scaling-config-volume
            - mountPath: /mnt/tracing
              name: tracing-config-volume
            - mountPath: /mnt/scheduler
              name: scheduler-state
          serviceAccountName: seldon-scheduler
          terminationGracePeriodSeconds: 5
          volumes:
          - configMap:
              name: seldon-scaling
            name: scaling-config-volume
          - configMap:
              name: seldon-kafka
            name: kafka-config-volume
          - configMap:
              name: seldon-tracing
            name: tracing-config-volume
        volumeClaimTemplates:
        - name: scheduler-state
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 1G
    
    kubectl apply -f kafka-secret.yaml -n seldon
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: join
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
        - name: tfsimple3      
          inputs:
          - tfsimple1.outputs.OUTPUT0
          - tfsimple2.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple2.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple3
    
    seldon_model_infer_total{
        code="200", 
        container="agent", 
        endpoint="metrics", 
        instance="10.244.0.39:9006", 
        job="seldon-mesh/agent", 
        method_type="rest", 
        model="irisa0", 
        model_internal="irisa0_1", 
        namespace="seldon-mesh", 
        pod="mlserver-0", 
        server="mlserver", 
        server_replica="0"
    }
      metrics:
      - type: Object
        object:
          describedObject:
            apiVersion: mlops.seldon.io/v1alpha1
            kind: Model
            name: irisa0
          metric:
            name: infer_rps
            selector:
              matchLabels:
                method_type: rest
          target:
    	    type: AverageValue
            averageValue: "3"
    irisa0.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: irisa0
      namespace: seldon-mesh
    spec:
      memory: 3M
      replicas: 1
      requirements:
      - sklearn
      storageUri: gs://seldon-models/testing/iris1
    irisa0-hpa.yaml
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: irisa0-model-hpa
      namespace: seldon-mesh
    spec:
      scaleTargetRef:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Object
        object:
          metric:
            name: infer_rps
          describedObject:
            apiVersion: mlops.seldon.io/v1alpha1
            kind: Model
            name: irisa0
          target:
            type: AverageValue
            averageValue: 3
    sum by (namespace) (
      rate (
        seldon_model_infer_total{namespace="seldon-mesh"}[2m]
      )
    )
    sum by (pod) (
      rate (
        seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[2m]
      )
    )
    {
      "name": "text",
      "version": "text",
      "extensions": [
        "text"
      ]
    }
    GET /v2/models/{model_name}/versions/{model_version} HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "name": "text",
      "versions": [
        "text"
      ],
      "platform": "text",
      "inputs": [
        {
          "name": "text",
          "datatype": "text",
          "shape": [
            1
          ],
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ],
      "outputs": [
        {
          "name": "text",
          "datatype": "text",
          "shape": [
            1
          ],
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ],
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      }
    }
    GET /v2/models/{model_name} HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "name": "text",
      "versions": [
        "text"
      ],
      "platform": "text",
      "inputs": [
        {
          "name": "text",
          "datatype": "text",
          "shape": [
            1
          ],
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ],
      "outputs": [
        {
          "name": "text",
          "datatype": "text",
          "shape": [
            1
          ],
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ],
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      }
    }
    GET /v2 HTTP/1.1
    Host: 
    Accept: */*
    

    What it scales with (max maxShardCountMultiplier = #partitions )

    Max replicas used

    Dataflow engine

    #pipelines × maxShardCountMultiplier (capped by configured replicas)

    min(replicas, pipelines × partitions)

    Model gateway

    #models × maxShardCountMultiplier (capped by replicas and maxNumConsumers)

    min(replicas, min(models, maxNumConsumers) × partitions)

    Pipeline gateway

    #pipelines × maxShardCountMultiplier (capped by replicas and maxNumConsumers)

    min(replicas, min(pipelines, maxNumConsumers) × partitions)

    Each pipeline/model is loaded only on a subset of replicas, and automatically rebalanced when:

    1. You scale replicas up/down

    2. You deploy or delete pipelines / models

    The configuration parameter determining the maximum number of component replicas on which a pipeline/model can be loaded is maxShardCountMultiplier. It can be set in SeldonConfig under config.scalingConfig.pipelines.maxShardCountMultiplier. For installs via helm, the value of this parameter defaults to {{ .Values.kafka.topics.numPartitions }}. In fact, the number of kafka partitions per topic is the maximum value that maxShardCountMultiplier should be set to. Increasing this value beyond the number of kafka partitions not only does not bring any additional performance benefits, but may actually lead to dropped requests due to the extra replicas receiving requests but managing no kafka partitions within their consumer groups.

    This parameter may be changed during cluster operation, with the new value being propagated to all components over a ~1 minute interval. Changing this value will cause pipelines/models to be rebalanced across the dataflow-engine, model-gateway, and pipeline-gateway replicas and may lead to downtime depending on the configured kafka partition assignment strategy. If the used Kafka version supports cooperative rebalancing of consumer groups, then setting the partition assignment strategy to cooperative-sticky will ensure that rebalancing happens with minimal disruption. dataflow-engine uses a cooperative rebalancing strategy by default.

    You do not need to manually assign work — it’s handled automatically.

    2. Scaling the dataflow engine

    Dataflow engine is responsible for executing pipeline logic and moving data between pipeline stages. Core 2 now supports running multiple pipelines in parallel across multiple dataflow engine replicas.

    2.1. What controls scaling?

    You control scaling using:

    Config
    Location
    Purpose

    spec.replicas

    SeldonRuntime

    Maximum number of dataflow engine instances

    config.scalingConfig.pipelines.maxShardCountMultiplier

    SeldonConfig

    Determines max replication per pipeline (max possible: #Kafka partitions)

    2.2. How many replicas will actually be used?

    Dataflow engine replicas are dynamically adjusted based on number of pipelines deployed and Kafka partitions. The final number of dataflow engine replicas is given by:

    $\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \text{pipelines} \times \text{partitions})$

    Example

    Pipelines deployed
    maxShardCountMultiplier
    spec.replicas
    Final dataflow replicas used

    3

    4

    9

    min(9, 3 x 4 = 12) → 9 replicas

    2

    4

    9

    min(9, 2 x 4 = 8) → 8 replicas

    1

    4

    Note: Unused replicas are automatically scaled down. As more pipelines are added, dataflow engine automatically scales up, capped by the maximum number of replicas.

    2.3. How are pipeline assigned to replicas?

    • Core 2 uses consistent hashing to distribute pipelines evenly across dataflow replicas. This ensures a balanced workload, but it does not guarantee a perfect one-to-one mapping.

      • Even if the number of replicas equals pipelines × partitions, small imbalances between the number of pipelines handled by each replica may exist. In practice, the distribution is statistically uniform.

    • Each pipeline is replicated across multiple dataflow engines (up to number of Kafka partitions).

    • When instances are added or removed, pipelines are automatically rebalanced.

    Note: This process is handled internally by Core 2, so no manual intervention is needed.

    2.4. Loading/unloading of the pipelines from dataflow engine

    • Loading/unloading of the pipeline from the dataflow engine is performed when the pipeline CR is loaded/unloaded.

    • The scheduler confirms whether the loading/unloading was performed successfully through the Pipeline status under the CR.

    Rebalancing happens in the background — you don’t need to intervene.

    Note: For pipelines, Pipeline ready status must be satisfied in order for the pipeline to be marked ready.

    3. Scaling the model gateway

    The model gateway is responsible for routing inference requests to models when used inside pipelines. Like the dataflow engine, it scales dynamically based on how many models are deployed.

    3.1. What controls scaling?

    Config
    Location
    Purpose

    spec.replcias

    SeldonRuntime

    Maximum number of model gateway instances

    config.scalingConfig.pipelines.maxShardCountMultiplier

    SeldonConfig

    Determines max replication per model (#Kafka partitions)

    maxNumConsumers

    SeldonConfig - model gateway enc var (default: 100)

    Caps how many distinct consumer groups can exist

    3.2. How many replicas will actually be used?

    Model gateway replicas are dynamically adjusted based on number of models deployed, Kafka partitions, and maxNumConsumers. The final number of model gateway replicas is given by:

    $\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{models}, \text{maxNumConsumers}) \times \text{partitions})$

    Example

    Models Deployed
    maxShardCountMultiplier
    spec.replicas
    maxNumConsumers
    Final model gateway replicas

    5

    4

    20

    100

    min(20, min(5, 100) x 4 = 20) = 20 → 20 replicas

    1

    4

    20

    100

    Note: If you remove models, the model gateway automatically scales down, and if we add models, the model gateway automatically scales up, capped by the maximum number of replicas.

    3.3. How are models assigned to replicas?

    Model gateway doesn’t load every model on every replica but only on a subset of replicas. The same principle as for dataflow engine applies for model gateway (sharding through consistent hashing).

    3.4. Loading/unloading of the models from model gateway

    • Loading/unloading of the model from the model gateway is performed when the model CR is loaded/unloaded.

    • The scheduler confirms whether the loading/unloading was performed successfully through the ModelGw status under the CR.

    Rebalancing happens in the background — you don’t need to intervene.

    Note: ModelGw status does not represent a condition for the model to be available. If the loading was successful on the dedicated servers, the model itself is ready for inference.

    • The ModelGw status becomes relevant for pipelines, or whether the end user wants to perform inference via the async path (i.e., writing the requests in the model input topic and reading the responses from the model output topic from Kafka).

    • In the context of pipelines, the ModelReady status becomes a conjunction on whether the model is available on servers and if the model has been loaded successfully on the model gateway.

    4. Scaling the pipeline gateway

    The pipeline gateway is responsible for writing the requests in the input topic of the pipeline, and wait for the response on the output topic. Like dataflow engine and model gateway, pipeline gateway can scale horizontally.

    4.1. What Controls Scaling?

    Config
    Location
    Purpose

    spec.replcias

    SeldonRuntime

    Maximum number of pipeline gateway instances

    config.scalingConfig.pipelines.maxShardCountMultiplier

    SeldonConfig

    Determines max replication per pipeline (# Kafka partitions)

    maxNumConsumers

    SeldonConfig - Pipeline gateway enc var (default: 100)

    Caps how many distinct consumer groups can exist

    4.2. How many replicas will actually be used?

    Pipeline gateway replicas are dynamically adjusted based on number of pipelines deployed, Kafka partitions, and maxNumConsumers. The final number of pipeline gateway replicas is given by:

    $\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{pipelines}, \text{maxNumConsumers}) \times \text{partitions})$

    Example

    Pipelines Deployed
    maxShardCountMultiplier
    spec.replicas
    maxNumConsumers
    Final pipeline gateway replicas

    8

    4

    10

    100

    min(10, min(8, 100) x 4 = 32) = 10 → 10 replicas

    2

    4

    10

    100

    Note: Similarly to dataflow engine, pipeline gateway scales up and down as pipeline are added and removed.

    4.3. How are pipeline assigned to replicas?

    Pipeline gateway doesn’t load every pipeline on every replica but only on a subset of replicas. The same principle as for dataflow engine and model gateway applies for pipeline gateway (sharding through consistent hashing).

    4.4. Loading/unloading of the pipelines from Pipeline Gateway

    • Loading/unloading of the pipeline from the pipeline gateway is performed when the pipeline CR is loaded/unloaded.

    • The scheduler confirms whether the loading/unloading was performed successfully through the PipelineGw status under the CR.

    Analogous with the previous services, rebalancing happens in the background — you don’t need to intervene.

    Note: For pipelines, PipelineGw ready status must be satisfied in order for the pipeline to be marked ready.

    Component

    Synchronous Requests

    For making synchronous requests, the process will generally be:

    1. Find the appropriate service endpoint (IP address and port) for accessing the installation of Seldon Core 2.

    2. Determine the appropriate headers/metadata for the request.

    3. Make requests via REST or gRPC.

    Find the Seldon Service Endpoint

    In the default Docker Compose setup, container ports are accessible from the host machine. This means you can use localhost or 0.0.0.0 as the hostname.

    The default port for sending inference requests to the Seldon system is 9000. This is controlled by the ENVOY_DATA_PORT environment variable for Compose.

    Putting this together, you can send inference requests to 0.0.0.0:9000.

    In Kubernetes, Seldon creates a single Service called seldon-mesh in the namespace it is installed into. By default, this namespace is also called seldon-mesh.

    If this Service is exposed via a load balancer, the appropriate address and port can be found via:

    If you are not using a LoadBalancer for the seldon-mesh Service, you can still send inference requests.

    For development and testing purposes, you can port-forward the Service locally using the below. Inference requests can then be sent to localhost:8080.

    If you are using a service mesh like Istio or Ambassador, you will need to use the IP address of the service mesh ingress and determine the appropriate port.

    Make Inference Requests

    Let us imagine making inference requests to a model called iris.

    This iris model has the following schema, which can be set in a model-settings.json file for MLServer:

    Examples are given below for some common tools for making requests.

    An example seldon request might look like this:

    The default inference mode is REST, but you can also send gRPC requests like this:

    An example curl request might look like this:

    An example grpcurl request might look like this:

    The above request was run from the project root folder allowing reference to the Protobuf manifests defined in the apis/ folder.

    You can use the Python package to send inference requests.

    A short, self-contained example is:

    Note: For pipelines, a synchronous request is possible if the pipeline has an outputs section defined in its spec.

    Request Routing

    Seldon Routes

    Seldon needs to determine where to route requests to, as models and pipelines might have the same name. There are two ways of doing this: header-based routing (preferred) and path-based routing.

    Seldon can route requests to the correct endpoint via headers in HTTP calls, both for REST (HTTP/1.1) and gRPC (HTTP/2).

    Use the Seldon-Model header as follows:

    • For models, use the model name as the value. For example, to send requests to a model named foo use the header Seldon-Model: foo.

    • For pipelines, use the pipeline name followed by .pipeline as the value. For example, to send requests to a pipeline named foo use the header Seldon-Model: foo.pipeline.

    The seldon CLI is aware of these rules and can be used to easily send requests to your deployed resources. See the and the for more information.

    The inference v2 protocol is only aware of models, thus has no concept of pipelines. Seldon works around this limitation by introducing virtual endpoints for pipelines. Virtual means that Seldon understands them, but other v2 protocol-compatible components like inference servers do not.

    Use the following rules for paths to route to models and pipelines:

    • For models, use the path prefix /v2/models/{model name}. This is normal usage of the inference v2 protocol.

    • For pipelines, you can use the path prefix /v2/pipelines/{pipeline name}. Otherwise calling pipelines looks just like the inference v2 protocol for models. Do not use any suffix for the pipeline name as you would for routing headers.

    Extending our examples from above, the requests may look like the below when using header-based routing.

    No changes are required as the seldon CLI already understands how to set the appropriate gRPC and REST headers.

    Note the rpc-header flag in the penultimate line:

    Note the headers dictionary in the client.infer() call:

    Ingress Routes

    If you are using an ingress controller to make inference requests with Seldon, you will need to configure the routing rules correctly.

    There are many ways to do this, but custom path prefixes will not work with gRPC. This is because gRPC determines the path based on the Protobuf definition. Some gRPC implementations permit manipulating paths when sending requests, but this is by no means universal.

    If you want to expose your inference endpoints via gRPC and REST in a consistent way, you should use virtual hosts, subdomains, or headers.

    The downside of using only paths is that you cannot differentiate between different installations of Seldon Core 2 or between traffic to Seldon and any other inference endpoints you may have exposed via the same ingress.

    You might want to use a mixture of these methods; the choice is yours.

    Virtual hosts are a way of differentiating between logical services accessed via the same physical machine(s).

    Virtual hosts are defined by the Host header for HTTP/1 and the :authority pseudo-header for HTTP/2. These represent the same thing, and the HTTP/2 specification defines how to translate these when converting between protocol versions.

    Many tools and libraries treat these headers as special and have particular ways of handling them. Some common ones are given below:

    • The seldon CLI has an --authority flag which applies to both REST and gRPC inference calls.

    • curl accepts Host as a normal header.

    • grpcurl has an -authority flag.

    • In Go, the standard library's http.Request struct has a Host field and ignores attempts to set this value via headers.

    • In Python, the requests library accepts the host as a normal header.

    Be sure to check the documentation for how to set this with your preferred tools and languages.

    Subdomain names constitute a part of the overall host name. As such, specifying a subdomain name for requests will involve setting the appropriate host in the URI.

    For example, you may expose inference services in the namespaces seldon-1 and seldon-2 as in the following snippets:

    Many popular ingresses support subdomain-based routing, including Istio and Nginx. Please refer to the documentation for your ingress of choice for further information.

    Many ingress controllers and service meshes support routing on headers. You can use whatever headers you prefer, so long as they do not conflict with any Seldon relies upon.

    Many tools and libraries support adding custom headers to requests. Some common ones are given below:

    • The seldon CLI accepts headers using the --header flag, which can be specified multiple times.

    • curl accepts headers using the -H or --header

    It is possible to route on paths by using well-known path prefixes defined by the inference v2 protocol. For gRPC, the full path (or "method") for an inference call is:

    This corresponds to the package (inference), service (GRPCInferenceService), and RPC name (ModelInfer) in the Protobuf definition of the inference v2 protocol.

    You could use an exact match or a regex like .*inference.* to match this path, for example.

    Asynchronous Requests

    The Seldon architecture uses Kafka and therefore asynchronous requests can be sent by pushing inference v2 protocol payloads to the appropriate topic. Topics have the following form:

    Note: If writing to a pipeline topic, you will need to include a Kafka header with the key pipeline and the value being the name of the pipeline.

    Model Inference

    For a local install if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.default.model.iris.inputs. The response will appear on seldon.default.model.iris.outputs.

    For a Kubernetes install in seldon-mesh if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.model.iris.inputs. The response will appear on seldon.seldon-mesh.model.iris.outputs.

    Pipeline Inference

    For a local install if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.default.pipeline.mypipeline.inputs. The response will appear on seldon.default.pipeline.mypipeline.outputs.

    For a Kubernetes install in seldon-mesh if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.pipeline.mypipeline.inputs. The response will appear on seldon.seldon-mesh.pipeline.mypipeline.outputs.

    Pipeline Metadata

    It may be useful to send metadata alongside your inference.

    If using Kafka directly as described above, you can attach Kafka metadata to your request, which will be passed around the graph. When making synchronous requests to your pipeline with REST or gRPC you can also do this.

    • For REST requests add HTTP headers prefixed with X-

    • For gRPC requests add metadata with keys starting with X-

    You can also do this with the Seldon CLI by setting headers with the --header argument (and also showing response headers with the --show-headers argument)

    For pipeline inference, the response also contains a x-pipeline-version header, indicating which version of pipeline it ran inference with.

    Request IDs

    For both model and pipeline requests the response will contain a x-request-id response header. For pipeline requests this can be used to inspect the pipeline steps via the CLI, e.g.:

    The --offset parameter specifies how many messages (from the latest) you want to search to find your request. If not specified the last request will be shown.

    x-request-id will also appear in tracing spans.

    If x-request-id is passed in by the caller then this will be used. It is the caller's responsibility to ensure it is unique.

    The IDs generated are XIDs.

    inference v2 protocol payload
    livenessProbe
    .
    Responses
    200

    OK

    No content

    get
    /v2/health/live
    200

    OK

    No content

    200

    OK

    No content

    Path parameters
    model_namestringRequired
    Responses
    200

    OK

    No content

    get
    /v2/models/{model_name}/ready
    200

    OK

    No content

    Body
    idstringOptional
    Responses
    200

    OK

    application/json
    400

    Bad Request

    application/json
    404

    Not Found

    application/json
    500

    Internal Server Error

    application/json
    post
    /v2/models/{model_name}/versions/{model_version}/infer
    idstringOptional
    Responses
    200

    OK

    application/json
    400

    Bad Request

    application/json
    404

    Not Found

    application/json
    500

    Internal Server Error

    application/json
    post
    /v2/models/{model_name}/infer

    Model zoo

    Examples of various model artifact types from various frameworks running under Seldon Core 2.

    • SKlearn

    • Tensorflow

    • XGBoost

    • ONNX

    • Lightgbm

    • MLFlow

    • PyTorch

    Python requirements in model-zoo-requirements.txt

    SKLearn Iris Classification Model

    The training code for this model can be found at scripts/models/iris in SCv2 repo.

    Tensorflow CIFAR10 Image Classification Model

    XGBoost Model

    The training code for this model can be found at ./scripts/models/income-xgb

    ONNX MNIST Model

    This model is a pretrained model as defined in ./scripts/models/Makefile target mnist-onnx

    LightGBM Model

    The training code for this model can be found at ./scripts/models/income-lgb

    MLFlow Wine Model

    The training code for this model can be found at ./scripts/models/wine-mlflow

    Pytorch MNIST Model

    This example model is downloaded and trained in ./scripts/models/Makefile target mnist-pytorch

    Local examples

    SKLearn Model

    We use a simple sklearn iris classification model

    Load the model

    kubectl apply -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    seldon model load -f ./models/sklearn-iris-gs.yaml
    {}

    Wait for the model to be ready

    kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    seldon model status iris -w ModelAvailable | jq -M .
    {}

    Do a REST inference call

    Do a gRPC inference call

    Unload the model

    Tensorflow Model

    We run a simple tensorflow model. Note the requirements section specifying tensorflow.

    Load the model.

    Wait for the model to be ready.

    Get model metadata

    Do a REST inference call.

    Do a gRPC inference call

    Unload the model

    Experiment

    We will use two SKlearn Iris classification models to illustrate an experiment.

    Load both models.

    Wait for both models to be ready.

    Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.

    Start the experiment.

    Wait for the experiment to be ready.

    Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.

    Run one more request

    Use sticky session key passed by last infer request to ensure same route is taken each time. We will test REST and gRPC.

    gPRC

    Stop the experiment

    Show the requests all go to original model now.

    Unload both models.

    https://github.com/SeldonIO/seldon-core/blob/v2/k8s/samples/values-runtime-kafka-compression.yaml
    seldon model infer iris \
            '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    seldon model infer iris \
            --inference-mode grpc \
            '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'
    curl -v http://0.0.0.0:9000/v2/models/iris/infer \
            -H "Content-Type: application/json" \
            -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    grpcurl \
    	-d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    	-plaintext \
    	-import-path apis \
    	-proto apis/mlops/v2_dataplane/v2_dataplane.proto \
    	0.0.0.0:9000 inference.GRPCInferenceService/ModelInfer
    :emphasize-lines: 4
    
    curl -v http://0.0.0.0:9000/v2/models/iris/infer \
            -H "Content-Type: application/json" \
            -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' \
            -H "Seldon-Model: iris"
    :emphasize-lines: 6
    
    grpcurl \
    	-d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    	-plaintext \
    	-import-path apis \
    	-proto apis/mlops/v2_dataplane/v2_dataplane.proto \
    	-rpc-header seldon-model:iris \
    	0.0.0.0:9000 inference.GRPCInferenceService/ModelInfer
    :emphasize-lines: 18
    
    import tritonclient.http as httpclient
    import numpy as np
    
    client = httpclient.InferenceServerClient(
        url="localhost:8080",
        verbose=False,
    )
    
    inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
    inputs[0].set_data_from_numpy(
        np.array([[1, 2, 3, 4]]).astype("float64"),
        binary_data=False,
    )
    
    result = client.infer(
        "iris",
        inputs,
        headers={"Seldon-Model": "iris"},
    )
    print("result is:", result.as_numpy("predict"))
    {
        "name": "iris",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "inputs": [
            {
                "name": "predict",
                "datatype": "FP32",
                "shape": [-1, 4]
            }
        ],
        "outputs": [
            {
                "name": "predict",
                "datatype": "INT64",
                "shape": [-1, 1]
            }
        ],
        "parameters": {
            "version": "1"
        }
    }
    seldon.<namespace>.<model|pipeline>.<name>.<inputs|outputs>
    seldon pipeline infer --show-headers --header X-foo=bar tfsimples \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    seldon pipeline inspect tfsimples --request-id carjjolvqj3j2pfbut10 --offset 10
    GET /v2/health/live HTTP/1.1
    Host: 
    Accept: */*
    
    GET /v2/health/ready HTTP/1.1
    Host: 
    Accept: */*
    
    POST /v2/models/{model_name}/versions/{model_version}/infer HTTP/1.1
    Host: 
    Content-Type: application/json
    Accept: */*
    Content-Length: 371
    
    {
      "id": "text",
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      },
      "inputs": [
        {
          "name": "text",
          "shape": [
            1
          ],
          "datatype": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          },
          "data": null
        }
      ],
      "outputs": [
        {
          "name": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ]
    }
    POST /v2/models/{model_name}/infer HTTP/1.1
    Host: 
    Content-Type: application/json
    Accept: */*
    Content-Length: 371
    
    {
      "id": "text",
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      },
      "inputs": [
        {
          "name": "text",
          "shape": [
            1
          ],
          "datatype": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          },
          "data": null
        }
      ],
      "outputs": [
        {
          "name": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          }
        }
      ]
    }
    cat ./models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    
    config:
      kafkaConfig:
        producer:
          compression.type: gzip
    
    
    

    9

    min(9. 1 x 4 = 4) → 4 replicas

    min(20, min(1, 100) x 4 = 4) = 4 → 4 replicas

    min(10, min(2, 100) x 4 = 8) = 8 → 8 replicas

    1

    4

    10

    100

    min(10, min(1, 100) x 4 = 4) = 4→ 4 replicas

    For pipelines, you can also use the path prefix /v2/models/{pipeline name}.pipeline. Again, this form looks just like the inference v2 protocol for models.

    flags.
  • grpcurl accepts headers using the -H flag, which can be specified multiple times.

  • tritonclient
    examples
    Seldon CLI docs
    kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    kubectl port-forward svc/seldon-mesh -n seldon-mesh 8080:80
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/pipelines/tfsimples.yaml
    import tritonclient.http as httpclient
    import numpy as np
    
    client = httpclient.InferenceServerClient(
        url="localhost:8080",
        verbose=False,
    )
    
    inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
    inputs[0].set_data_from_numpy(
        np.array([[1, 2, 3, 4]]).astype("float64"),
        binary_data=False,
    )
    
    result = client.infer("iris", inputs)
    print("result is:", result.as_numpy("predict"))
    curl https://seldon-1.example.com/v2/models/iris/infer ...
    
    seldon model infer --inference-host https://seldon-2.example.com/v2/models/iris/infer ...
    /inference.GRPCInferenceService/ModelInfer
    GET /v2/models/{model_name}/ready HTTP/1.1
    Host: 
    Accept: */*
    
    {
      "model_name": "text",
      "model_version": "text",
      "id": "text",
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      },
      "outputs": [
        {
          "name": "text",
          "shape": [
            1
          ],
          "datatype": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          },
          "data": null
        }
      ]
    }
    {
      "model_name": "text",
      "model_version": "text",
      "id": "text",
      "parameters": {
        "content_type": "text",
        "headers": {},
        "ANY_ADDITIONAL_PROPERTY": "anything"
      },
      "outputs": [
        {
          "name": "text",
          "shape": [
            1
          ],
          "datatype": "text",
          "parameters": {
            "content_type": "text",
            "headers": {},
            "ANY_ADDITIONAL_PROPERTY": "anything"
          },
          "data": null
        }
      ]
    }
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "983bd95f-4b4d-4ff1-95b2-df9d6d089164",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    seldon model infer iris \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "983bd95f-4b4d-4ff1-95b2-df9d6d089164",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    seldon model infer iris --inference-mode grpc \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .
    {
      "modelName": "iris_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "predict",
          "datatype": "INT64",
          "shape": [
            "1",
            "1"
          ],
          "parameters": {
            "content_type": {
              "stringParam": "np"
            }
          },
          "contents": {
            "int64Contents": [
              "2"
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    seldon model unload iris
    cat ./models/tfsimple1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl apply -f ./models/tfsimple1.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 created
    seldon model load -f ./models/tfsimple1.yaml
    {}
    kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}
    model.mlops.seldon.io/tfsimple1 condition met
    seldon model status tfsimple1 -w ModelAvailable | jq -M .
    {}
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/tfsimple1'
    {
    	"name": "tfsimple1_1",
    	"versions": [
    		"1"
    	],
    	"platform": "tensorflow_graphdef",
    	"inputs": [
    		{
    			"name": "INPUT0",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		},
    		{
    			"name": "INPUT1",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		}
    	],
    	"outputs": [
    		{
    			"name": "OUTPUT0",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		},
    		{
    			"name": "OUTPUT1",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		}
    	]
    }
    
    seldon model metadata tfsimple1
    {
    	"name": "tfsimple1_1",
    	"versions": [
    		"1"
    	],
    	"platform": "tensorflow_graphdef",
    	"inputs": [
    		{
    			"name": "INPUT0",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		},
    		{
    			"name": "INPUT1",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		}
    	],
    	"outputs": [
    		{
    			"name": "OUTPUT0",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		},
    		{
    			"name": "OUTPUT1",
    			"datatype": "INT32",
    			"shape": [
    				-1,
    				16
    			]
    		}
    	]
    }
    
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/tfsimple1/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ]
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ]
        }
      ]
    }
    
    seldon model infer tfsimple1 \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ]
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ]
        }
      ]
    }
    
    seldon model infer tfsimple1 --inference-mode grpc \
        '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "modelName": "tfsimple1_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0,
              0
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./models/tfsimple1.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "tfsimple1" deleted
    seldon model unload tfsimple1
    cat ./models/sklearn1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    cat ./models/sklearn2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris2
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    kubectl apply -f ./models/sklearn1.yaml -n ${NAMESPACE}
    kubectl apply -f ./models/sklearn2.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/sklearn1 created
    model.mlops.seldon.io/sklearn2 created
    seldon model load -f ./models/sklearn1.yaml
    seldon model load -f ./models/sklearn2.yaml
    {}
    {}
    kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model iris2 -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    model.mlops.seldon.io/iris2 condition met
    seldon model status iris | jq -M .
    seldon model status iris2 | jq -M .
    {
      "modelName": "iris",
      "versions": [
        {
          "version": 1,
          "serverName": "mlserver",
          "kubernetesMeta": {},
          "modelReplicaState": {
            "0": {
              "state": "Available",
              "lastChangeTimestamp": "2023-06-29T14:01:41.362720538Z"
            }
          },
          "state": {
            "state": "ModelAvailable",
            "availableReplicas": 1,
            "lastChangeTimestamp": "2023-06-29T14:01:41.362720538Z"
          },
          "modelDefn": {
            "meta": {
              "name": "iris",
              "kubernetesMeta": {}
            },
            "modelSpec": {
              "uri": "gs://seldon-models/mlserver/iris",
              "requirements": [
                "sklearn"
              ]
            },
            "deploymentSpec": {
              "replicas": 1
            }
          }
        }
      ]
    }
    {
      "modelName": "iris2",
      "versions": [
        {
          "version": 1,
          "serverName": "mlserver",
          "kubernetesMeta": {},
          "modelReplicaState": {
            "0": {
              "state": "Available",
              "lastChangeTimestamp": "2023-06-29T14:01:41.362845079Z"
            }
          },
          "state": {
            "state": "ModelAvailable",
            "availableReplicas": 1,
            "lastChangeTimestamp": "2023-06-29T14:01:41.362845079Z"
          },
          "modelDefn": {
            "meta": {
              "name": "iris2",
              "kubernetesMeta": {}
            },
            "modelSpec": {
              "uri": "gs://seldon-models/mlserver/iris",
              "requirements": [
                "sklearn"
              ]
            },
            "deploymentSpec": {
              "replicas": 1
            }
          }
        }
      ]
    }
    
    cat ./experiments/ab-default-model.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model.yaml
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 100 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::57 :iris_1::43]
    
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	  --header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    seldon model infer iris \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "fa425bdf-737c-41fe-894d-58868f70fe5d",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	  --header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    seldon model infer iris -s -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris --inference-mode grpc -s -i 50\
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'
    Success: map[:iris_1::50]
    
    seldon experiment stop experiment-sample
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	  --header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    seldon model infer iris -i 100 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::100]
    
    kubectl delete -f ./models/sklearn1.yaml -n ${NAMESPACE}
    kubectl delete -f ./models/sklearn2.yaml -n ${NAMESPACE}
    seldon model unload iris
    seldon model unload iris2
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimples
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
          inputs:
          - tfsimple1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple1.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple2
    
    png
    png
    png

    Conditional pipeline with pandas query model

    The model is defined as an MLServer custom runtime and allows the user to pass in a custom pandas query as a parameter defined at model creation to be used to filter the data passed to the model.

    Conditional Pipeline using PandasQuery

    Production image classifier with drift and outlier monitoring

    Run these examples from the samples/examples/image_classifier folder.

    CIFAR10 Image Classification Production Deployment

    We show an image classifier (CIFAR10) with associated outlier and drift detectors using a Pipeline.

    https://github.com/SeldonIO/seldon-core/blob/v2/scheduler/config/kafka-internal.json
    cat ./models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    
    kubectl apply -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "09263298-ca66-49c5-acb9-0ca75b06f825",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    kubectl delete -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    seldon model load -f ./models/sklearn-iris-gs.yaml
    {}
    seldon model status iris -w ModelAvailable | jq -M .
    {}
    seldon model infer iris \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "09263298-ca66-49c5-acb9-0ca75b06f825",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    seldon model unload iris
    {}
    
    import requests
    import json
    from typing import Dict, List
    import numpy as np
    import os
    import tensorflow as tf
    from alibi_detect.utils.perturbation import apply_mask
    from alibi_detect.datasets import fetch_cifar10c
    import matplotlib.pyplot as plt
    tf.keras.backend.clear_session()
    2023-03-09 19:43:43.637892: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
    2023-03-09 19:43:43.637906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    
    train, test = tf.keras.datasets.cifar10.load_data()
    X_train, y_train = train
    X_test, y_test = test
    
    X_train = X_train.astype('float32') / 255
    X_test = X_test.astype('float32') / 255
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    classes = (
        "plane",
        "car",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    )
    
    (50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)
    
    reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    
    def infer(resourceName: str, idx: int):
        rows = X_train[idx:idx+1]
        show(rows[0])
        reqJson["inputs"][0]["data"] = rows.flatten().tolist()
        reqJson["inputs"][0]["shape"] = [1, 32, 32, 3]
        headers = {"Content-Type": "application/json", "seldon-model":resourceName}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        probs = np.array(response_raw.json()["outputs"][0]["data"])
        print(classes[probs.argmax(axis=0)])
    
    
    def show(X):
        plt.imshow(X.reshape(32, 32, 3))
        plt.axis("off")
        plt.show()
    
    cat ./models/cifar10-no-config.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/tensorflow/cifar10"
      requirements:
      - tensorflow
    
    kubectl apply -f ./models/cifar10-no-config.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/cifar10 created
    kubectl wait --for condition=ready --timeout=300s model cifar10 -n ${NAMESPACE}
    model.mlops.seldon.io/cifar10 condition met
    seldon model load -f ./models/cifar10-no-config.yaml
    {}
    seldon model status cifar10 -w ModelAvailable | jq -M .
    {}
    infer("cifar10",4)
    car
    
    kubectl delete -f ./models/cifar10-no-config.yaml -n ${NAMESPACE}
    seldon model unload cifar10
    {}
    cat ./models/income-xgb.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-xgb
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/income-xgb"
      requirements:
      - xgboost
    
    kubectl apply -f ./models/income-xgb.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/income-xgb is created
    kubectl wait --for condition=ready --timeout=300s model income-xgb -n ${NAMESPACE}
    model.mlops.seldon.io/income-xgb condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/income-xgb/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'
    {
    	"model_name": "income-lgb_1",
    	"model_version": "1",
    	"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "FP64",
    			"data": [
    				0.06279460120044741
    			]
    		}
    	]
    }
    kubectl delete -f ./models/income-xgb.yaml -n ${NAMESPACE}
    seldon model load -f ./models/income-xgb.yaml
    {}
    seldon model status income-xgb -w ModelAvailable | jq -M .
    {}
    seldon model infer income-xgb \
      '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'
    {
    	"model_name": "income-xgb_1",
    	"model_version": "1",
    	"id": "e30c3b44-fa14-4e5f-88f5-d6f4d287da20",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "FP32",
    			"data": [
    				-1.8380107879638672
    			]
    		}
    	]
    }
    
    seldon model unload income-xgb
    {}
    import matplotlib.pyplot as plt
    import json
    import requests
    from torchvision.datasets import MNIST
    from torchvision.transforms import ToTensor
    from torchvision import transforms
    from torch.utils.data import DataLoader
    import numpy as np
    training_data = MNIST(
        root=".",
        download=True,
        train=False,
        transform = transforms.Compose([
                  transforms.ToTensor()
              ])
    )
    
    reqJson = json.loads('{"inputs":[{"name":"Input3","data":[],"datatype":"FP32","shape":[]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    dl = DataLoader(training_data, batch_size=1, shuffle=False)
    dlIter = iter(dl)
    
    def infer_mnist():
        x, y = next(dlIter)
        data = x.cpu().numpy()
        reqJson["inputs"][0]["data"] = data.flatten().tolist()
        reqJson["inputs"][0]["shape"] = [1, 1, 28, 28]
        headers = {"Content-Type": "application/json", "seldon-model":"mnist-onnx"}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        show_mnist(x)
        probs = np.array(response_raw.json()["outputs"][0]["data"])
        print(probs.argmax(axis=0))
    
    
    def show_mnist(X):
        plt.imshow(X.reshape(28, 28))
        plt.axis("off")
        plt.show()
    cat ./models/mnist-onnx.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mnist-onnx
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mnist-onnx"
      requirements:
      - onnx
    
    kubectl apply -f ./models/mnist-onnx.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/mnist-onnx created
    kubectl wait --for condition=ready --timeout=300s model mnist-onnx -n ${NAMESPACE}
    model.mlops.seldon.io/mnist-onnx condition met
    seldon model load -f ./models/mnist-onnx.yaml
    {}
    seldon model status mnist-onnx -w ModelAvailable | jq -M .
    {}
    infer_mnist()
    7
    
    kubectl delete -f ./models/mnist-onnx.yaml -n ${NAMESPACE}
    seldon model unload mnist-onnx
    {}
    cat ./models/income-lgb.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-lgb
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/income-lgb"
      requirements:
      - lightgbm
    
    kubectl apply -f ./models/income-lgb.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/income-lgb created
    kubectl wait --for condition=ready --timeout=300s model income-lgb -n ${NAMESPACE}
    model.mlops.seldon.io/income-lgb condition met
    curl --location 'http://${SELDON_INFER_HOST}/v2/models/income-lgb/infer' \
    	--header 'Content-Type: application/json'  \
        --data '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'
    {
    	"model_name": "income-lgb_1",
    	"model_version": "1",
    	"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "FP64",
    			"data": [
    				0.06279460120044741
    			]
    		}
    	]
    }
    kubectl delete -f ./models/income-lgb.yaml -n ${NAMESPACE}
    seldon model load -f ./models/income-lgb.yaml
    {}
    seldon model status income-lgb -w ModelAvailable | jq -M .
    {}
    seldon model infer income-lgb \
      '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'
    {
    	"model_name": "income-lgb_1",
    	"model_version": "1",
    	"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "FP64",
    			"data": [
    				0.06279460120044741
    			]
    		}
    	]
    }
    seldon model unload income-lgb
    {}
    cat ./models/wine-mlflow.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: wine
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/wine-mlflow"
      requirements:
      - mlflow
    
    kubectl apply -f ./models/wine-mlflow.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/wine created
    kubectl wait --for condition=ready --timeout=300s model wine -n ${NAMESPACE}
    model.mlops.seldon.io/wine condition met
    seldon model load -f ./models/wine-mlflow.yaml
    {}
    seldon model status wine -w ModelAvailable | jq -M .
    {}
    import requests
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    inference_request = {
        "inputs": [
            {
              "name": "fixed acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [7.4],
            },
            {
              "name": "volatile acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.7000],
            },
            {
              "name": "citric acid",
              "shape": [1],
              "datatype": "FP32",
              "data": [0],
            },
            {
              "name": "residual sugar",
              "shape": [1],
              "datatype": "FP32",
              "data": [1.9],
            },
            {
              "name": "chlorides",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.076],
            },
            {
              "name": "free sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [11],
            },
            {
              "name": "total sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [34],
            },
            {
              "name": "density",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.9978],
            },
            {
              "name": "pH",
              "shape": [1],
              "datatype": "FP32",
              "data": [3.51],
            },
            {
              "name": "sulphates",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.56],
            },
            {
              "name": "alcohol",
              "shape": [1],
              "datatype": "FP32",
              "data": [9.4],
            },
        ]
    }
    headers = {"Content-Type": "application/json", "seldon-model":"wine"}
    response_raw = requests.post(url, json=inference_request, headers=headers)
    print(response_raw.json())
    {'model_name': 'wine_1', 'model_version': '1', 'id': '0d7e44f8-b46c-4438-b8af-a749e6aa6039', 'parameters': {}, 'outputs': [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'data': [5.576883936610762]}]}
    
    kubectl delete model wine -n ${NAMESPACE}
    seldon model unload wine
    {}
    
    import numpy as np
    import matplotlib.pyplot as plt
    import json
    import requests
    from torchvision.datasets import MNIST
    from torchvision.transforms import ToTensor
    from torchvision import transforms
    from torch.utils.data import DataLoader
    training_data = MNIST(
        root=".",
        download=True,
        train=False,
        transform = transforms.Compose([
                  transforms.ToTensor()
              ])
    )
    
    reqJson = json.loads('{"inputs":[{"name":"x__0","data":[],"datatype":"FP32","shape":[]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    dl = DataLoader(training_data, batch_size=1, shuffle=False)
    dlIter = iter(dl)
    
    def infer_mnist():
        x, y = next(dlIter)
        data = x.cpu().numpy()
        reqJson["inputs"][0]["data"] = data.flatten().tolist()
        reqJson["inputs"][0]["shape"] = [1, 1, 28, 28]
        headers = {"Content-Type": "application/json", "seldon-model":"mnist-pytorch"}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        show_mnist(x)
        probs = np.array(response_raw.json()["outputs"][0]["data"])
        print(probs.argmax(axis=0))
    
    
    def show_mnist(X):
        plt.imshow(X.reshape(28, 28))
        plt.axis("off")
        plt.show()
    cat ./models/mnist-pytorch.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mnist-pytorch
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mnist-pytorch"
      requirements:
      - pytorch
    
    kubectl apply -f ./models/mnist-pytorch.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/mnist-pytorch created
    kubectl wait --for condition=ready --timeout=300s model mnist-pytorch -n ${NAMESPACE}
    model.mlops.seldon.io/mnist-pytorch condition met
    seldon model load -f ./models/mnist-pytorch.yaml
    {}
    
    seldon model status mnist-pytorch -w ModelAvailable | jq -M .
    {}
    
    infer_mnist()
    7
    
    kubectl delete -f ./models/mnist-pytorch.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "mnist-pytorch" deleted
    seldon model unload mnist-pytorch
    {}
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.codecs import PandasCodec
    from mlserver.errors import MLServerError
    import pandas as pd
    from fastapi import status
    from mlserver.logging import logger
    
    QUERY_KEY = "query"
    
    
    class ModelParametersMissing(MLServerError):
      def __init__(self, model_name: str, reason: str):
        super().__init__(
          f"Parameters missing for model {model_name} {reason}", status.HTTP_400_BAD_REQUEST
        )
    
    class PandasQueryRuntime(MLModel):
    
      async def load(self) -> bool:
        logger.info("Loading with settings %s", self.settings)
        if self.settings.parameters is None or \
          self.settings.parameters.extra is None:
          raise ModelParametersMissing(self.name, "no settings.parameters.extra found")
        self.query = self.settings.parameters.extra[QUERY_KEY]
        if self.query is None:
          raise ModelParametersMissing(self.name, "no settings.parameters.extra.query found")
        self.ready = True
    
        return self.ready
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        input_df: pd.DataFrame = PandasCodec.decode_request(payload)
        # run query on input_df and save in output_df
        output_df = input_df.query(self.query)
        if output_df.empty:
          output_df = pd.DataFrame({'status':["no rows satisfied " + self.query]})
        else:
          output_df["status"] = "row satisfied " + self.query
        return PandasCodec.encode_response(self.name, output_df, self.version)
    cat ../../models/choice1.yaml
    echo "---"
    cat ../../models/choice2.yaml
    echo "---"
    cat ../../models/add10.yaml
    echo "---"
    cat ../../models/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-one
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 1"
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: choice-is-two
    spec:
      storageUri: "gs://seldon-models/scv2/examples/pandasquery"
      requirements:
      - mlserver
      - python
      parameters:
      - name: query
        value: "choice == 2"
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    
    kubectl apply -f ../../models/choice1.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/choice2.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/add10.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/mul10.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/choice1 created
    model.mlops.seldon.io/choice2 created
    model.mlops.seldon.io/add10 created
    model.mlops.seldon.io/mul10 created
    seldon model load -f ../../models/choice1.yaml
    seldon model load -f ../../models/choice2.yaml
    seldon model load -f ../../models/add10.yaml
    seldon model load -f ../../models/mul10.yaml
    {}
    {}
    {}
    {}
    kubectl wait --for condition=ready --timeout=300s model choice-is-one -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model choice-is-two -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model add10 -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model mul10 -n ${NAMESPACE}
    model.mlops.seldon.io/choice-is-one condition met
    model.mlops.seldon.io/choice-is-two condition met
    model.mlops.seldon.io/add10 condition met
    model.mlops.seldon.io/mul10 condition met
    seldon model status choice-is-one -w ModelAvailable
    seldon model status choice-is-two -w ModelAvailable
    seldon model status add10 -w ModelAvailable
    seldon model status mul10 -w ModelAvailable
    {}
    {}
    {}
    {}
    
    cat ../../pipelines/choice.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: choice
    spec:
      steps:
      - name: choice-is-one
      - name: mul10
        inputs:
        - choice.inputs.INPUT
        triggers:
        - choice-is-one.outputs.choice
      - name: choice-is-two
      - name: add10
        inputs:
        - choice.inputs.INPUT
        triggers:
        - choice-is-two.outputs.choice
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    
    kubectl apply -f pipelines/choice.yaml
    pipeline.mlops.seldon.io/choice created
    kubectl wait --for condition=ready --timeout=300s pipelines choice -n ${NAMESPACE}
    pipeline.mlops.seldon.io/choice condition met
    seldon pipeline load -f ../../pipelines/choice.yaml
    seldon pipeline status choice -w PipelineReady | jq -M .
    {
      "pipelineName": "choice",
      "versions": [
        {
          "pipeline": {
            "name": "choice",
            "uid": "cifel9aufmbc73e5intg",
            "version": 1,
            "steps": [
              {
                "name": "add10",
                "inputs": [
                  "choice.inputs.INPUT"
                ],
                "triggers": [
                  "choice-is-two.outputs.choice"
                ]
              },
              {
                "name": "choice-is-one"
              },
              {
                "name": "choice-is-two"
              },
              {
                "name": "mul10",
                "inputs": [
                  "choice.inputs.INPUT"
                ],
                "triggers": [
                  "choice-is-one.outputs.choice"
                ]
              }
            ],
            "output": {
              "steps": [
                "mul10.outputs",
                "add10.outputs"
              ],
              "stepsJoin": "ANY"
            },
            "kubernetesMeta": {}
          },
          "state": {
            "pipelineVersion": 1,
            "status": "PipelineReady",
            "reason": "created pipeline",
            "lastChangeTimestamp": "2023-06-30T14:45:57.284684328Z",
            "modelsReady": true
          }
        }
      ]
    }
    seldon pipeline infer choice --inference-mode grpc \
     '{"model_name":"choice","inputs":[{"name":"choice","contents":{"int_contents":[1]},"datatype":"INT32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[5,6,7,8]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              50,
              60,
              70,
              80
            ]
          }
        }
      ]
    }
    
    seldon pipeline infer choice --inference-mode grpc \
     '{"model_name":"choice","inputs":[{"name":"choice","contents":{"int_contents":[2]},"datatype":"INT32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[5,6,7,8]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              15,
              16,
              17,
              18
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ../../models/choice1.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/choice2.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/add10.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/mul10.yaml -n ${NAMESPACE}
    kubectl delete -f ../../pipelines/choice.yaml -n ${NAMESPACE}
    seldon model unload choice-is-one
    seldon model unload choice-is-two
    seldon model unload add10
    seldon model unload mul10
    seldon pipeline unload choice
    {
        "topicPrefix": "seldon",
        "bootstrap.servers":"kafka:9093",
        "consumer":{
    	"session.timeout.ms":6000,
    	"auto.offset.reset":"earliest",
    	"topic.metadata.propagation.max.ms": 300000,
    	"message.max.bytes":1000000000
        },
        "producer":{
    	"linger.ms":0,
    	"message.max.bytes":1000000000
        },
        "streams":{
        }
    }
    
    The model is a tensorflow CIFAR10 image classfier
  • The outlier detector is created from the CIFAR10 VAE Outlier example.

  • The drift detector is created from the CIFAR10 KS Drift example

  • Model Training (optional for notebook)

    To run local training run the training notebook.

    Pipeline

    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png
    png

    Use the seldon CLI to look at the outputs from the CIFAR10 model. It will decode the Triton binary outputs for us.

    cifar10

    Local experiments

    Model Experiment

    We will use two SKlearn Iris classification models to illustrate experiments.

    Load both models.

    Wait for both models to be ready.

    Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.

    Start the experiment.

    Wait for the experiment to be ready.

    Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.

    Show sticky session header x-seldon-route that is returned

    Use sticky session key passed by last infer request to ensure same route is taken each time.

    Stop the experiment

    Unload both models.

    Pipeline Experiment

    Use sticky session key passed by last infer request to ensure same route is taken each time.

    Model Mirror Experiment

    We will use two SKlearn Iris classification models to illustrate a model with a mirror.

    Load both models.

    Wait for both models to be ready.

    Create an experiment that modifies in which we mirror traffic to iris also to iris2.

    Start the experiment.

    Wait for the experiment to be ready.

    We get responses from iris but all requests would also have been mirrored to iris2

    We can check the local prometheus port from the agent to validate requests went to iris2

    Stop the experiment

    Unload both models.

    Pipeline Mirror Experiment

    Let's check that the mul10 model was called.

    Let's do an http call and check agaib the two models

    Huggingface speech to sentiment with explanations pipeline

    In this example we create a Pipeline to chain two huggingface models to allow speech to sentiment functionalityand add an explainer to understand the result.

    This example also illustrates how explainers can target pipelines to allow complex explanations flows.

    architecture

    This example requires ffmpeg package to be installed locally. run make install-requirements for the Python dependencies.

    Create a method to load speech from recorder; transform into mp3 and send at base64 data. On return of the result extract and show the text and sentiment.

    Load Huggingface Models

    We will load two Huggingface models for speech to text and text to sentiment.

    Create Explain Pipeline

    To allow Alibi-Explain to more easily explain the sentiment we will need:

    • input and output transfrorms that take the Dict values input and output by the Huggingface sentiment model and turn them into values that Alibi-Explain can easily understand with the core values we want to explain and the outputs from the sentiment model.

    • A separate Pipeline to allow us to join the sentiment model with the output transform

    These transform models are MLServer custom runtimes as shown below:

    Speech to Sentiment Pipeline with Explanation

    We can now create the final pipeline that will take speech and generate sentiment alongwith an explanation of why that sentiment was predicted.

    Test

    We will wait for the explanation which is run asynchronously to the functional output from the Pipeline above.

    Cleanup

    Example: Serving models on dedicated GPU nodes

    This example illustrates how to use taints, tolerations with nodeAffinity or nodeSelector to assign GPU nodes to specific models.

    Note: Configuration options depend on your cluster setup and the desired outcome. The Seldon CRDs for Seldon Core 2 Pods offer complete customization of Pod specifications, allowing you to apply additional Kubernetes customizations as needed.

    To serve a model on a dedicated GPU node, you should follow these steps:

    1. Configuring the node

    Configuring the GPU node

    Note: To dedicate a set of nodes to run only a specific group of inference servers, you must first provision an additional set of nodes within the Kubernetes cluster for the remaining Seldon Core 2 components. For more information about adding labels and taint to the GPU nodes in your Kubernetes cluster refer to the respective cloud provider documentation.

    You can add the taint when you are creating the node or after the node has been provisioned. You can apply the same taint to multiple nodes, not just a single node. A common approach is to define the taint at the node pool level.

    When you apply a NoSchedule taint to a node after it is created it may result in existing Pods that do not have a matching toleration to remain on the node without being evicted. To ensure that such Pods are removed, you can use the NoExecute taint effect instead.

    In this example, the node includes several labels that are used later for node affinity settings. You may choose to specify some labels, while others are usually added by the cloud provider or a GPU operator installed in the cluster. \

    Configure inference servers

    To ensure a specific inference server Pod runs only on the nodes you've configured, you can use nodeSelector or nodeAffinity together with a toleration by modifying one of the following:

    • Seldon Server custom resource: Apply changes to each individual inference server.

    • ServerConfig custom resource: Apply settings across multiple inference servers at once.

    Configuring Seldon Server custom resource While nodeSelector requires an exact match of node labels for server Pods to select a node, nodeAffinity offers more fine-grained control. It enables a conditional approach by using logical operators in the node selection process. For more information, see .

    In this example, a nodeSelector and a toleration is set for the Seldon Server custom resource.

    In this example, a nodeAffinity and a toleration is set for the Seldon Server custom resource.

    You can configure more advanced Pod selection using nodeAffinity, as in this example:

    Configuring ServerConfig custom resource

    This configuration automatically affects all servers using that ServerConfig, unless you specify server-specific overrides, which takes precedence.

    Configuring models

    When you have a set of inference servers running exclusively on GPU nodes, you can assign a model to one of those servers in two ways:

    • Custom model requirements (recommended)

    • Explicit server pinning

    Here's the distinction between the two methods of assigning models to servers.

    Method
    Behavior

    When you specify a requirement matching a server capability in the model custom resource it loads the model on any inference server with a capability matching the requirements.

    Ensure that the additional capability that matches the requirement label is added to the Server custom resource.

    Instead of adding a capability using extraCapabilities on a Server custom resource, you may also add to the list of capabilities in the associated ServerConfig custom resource. This applies to all servers referencing the configuration.

    With these specifications, the model is loaded on replicas of inference servers created by the referenced Server custom resource.

    Kubernetes examples

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see .

    Before you begin

    Batch

    Requires mlserver to be installed.

    Deprecated: The MLServer CLI infer feature is experimental and will be removed in future work.

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see

    kubectl apply -f ../../models/cifar10.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/cifar10-outlier-detect.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/cifar10-drift-detect.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/cifar10 created
    model.mlops.seldon.io/cifar10-outlier created
    model.mlops.seldon.io/cifar10-drift created
    kubectl wait --for condition=ready --timeout=300s model cifar10 -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model cifar10-outlier -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model cifar10-drift -n ${NAMESPACE}
    model.mlops.seldon.io/cifar10 condition met
    model.mlops.seldon.io/cifar10-outlier met
    model.mlops.seldon.io/cifar10-drift condition met
    seldon model load -f ../../models/cifar10.yaml
    seldon model load -f ../../models/cifar10-outlier-detect.yaml
    seldon model load -f ../../models/cifar10-drift-detect.yaml
    {}
    {}
    {}
    
    seldon model status cifar10 -w ModelAvailable | jq .
    seldon model status cifar10-outlier -w ModelAvailable | jq .
    seldon model status cifar10-drift -w ModelAvailable | jq .
    {}
    {}
    {}
    
    kubectl apply -f ../../pipelines/cifar10.yaml  -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s pipelines cifar10-production -n ${NAMESPACE}
    pipeline.mlops.seldon.io/cifar10-production condition met
    seldon pipeline load -f ../../pipelines/cifar10.yaml
    seldon pipeline status cifar10-production -w PipelineReady | jq -M .
    {
      "pipelineName": "cifar10-production",
      "versions": [
        {
          "pipeline": {
            "name": "cifar10-production",
            "uid": "cifeii2ufmbc73e5insg",
            "version": 1,
            "steps": [
              {
                "name": "cifar10"
              },
              {
                "name": "cifar10-drift",
                "batch": {
                  "size": 20
                }
              },
              {
                "name": "cifar10-outlier"
              }
            ],
            "output": {
              "steps": [
                "cifar10.outputs",
                "cifar10-outlier.outputs.is_outlier"
              ]
            },
            "kubernetesMeta": {}
          },
          "state": {
            "pipelineVersion": 1,
            "status": "PipelineReady",
            "reason": "created pipeline",
            "lastChangeTimestamp": "2023-06-30T14:40:09.047429817Z",
            "modelsReady": true
          }
        }
      ]
    }
    
    kubectl delete -f ../../piplines/cifar10.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/cifar10.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/cifar10-outlier-detect.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/cifar10-drift-detect.yaml -n ${NAMESPACE}
    seldon pipeline unload cifar10-production
    seldon model unload cifar10
    seldon model unload cifar10-outlier
    seldon model unload cifar10-drift
    import requests
    import json
    from typing import Dict, List
    import numpy as np
    import os
    import tensorflow as tf
    from alibi_detect.utils.perturbation import apply_mask
    from alibi_detect.datasets import fetch_cifar10c
    import matplotlib.pyplot as plt
    tf.keras.backend.clear_session()
    2023-06-30 15:39:28.732453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
    2023-06-30 15:39:28.732465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    
    train, test = tf.keras.datasets.cifar10.load_data()
    X_train, y_train = train
    X_test, y_test = test
    
    X_train = X_train.astype('float32') / 255
    X_test = X_test.astype('float32') / 255
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    classes = (
        "plane",
        "car",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    )
    
    (50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)
    
    outliers = []
    for idx in range(0,X_train.shape[0]):
        X_mask, mask = apply_mask(X_train[idx].reshape(1, 32, 32, 3),
                                      mask_size=(14,14),
                                      n_masks=1,
                                      channels=[0,1,2],
                                      mask_type='normal',
                                      noise_distr=(0,1),
                                      clip_rng=(0,1))
        outliers.append(X_mask)
    X_outliers = np.vstack(outliers)
    X_outliers.shape
    (50000, 32, 32, 3)
    
    corruption = ['brightness']
    X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
    X_corr = X_corr.astype('float32') / 255
    reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    def infer(resourceName: str, batchSz: int, requestType: str):
        if requestType == "outlier":
            rows = X_outliers[0:0+batchSz]
        elif requestType == "drift":
            rows = X_corr[0:0+batchSz]
        else:
            rows = X_train[0:0+batchSz]
        for i in range(batchSz):
            show(rows[i])
        reqJson["inputs"][0]["data"] = rows.flatten().tolist()
        reqJson["inputs"][0]["shape"] = [batchSz, 32, 32, 3]
        headers = {"Content-Type": "application/json", "seldon-model":resourceName}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        print(response_raw)
        print(response_raw.json())
    
    
    def show(X):
        plt.imshow(X.reshape(32, 32, 3))
        plt.axis("off")
        plt.show()
    
    cat ../../models/cifar10.yaml
    echo "---"
    cat ../../models/cifar10-outlier-detect.yaml
    echo "---"
    cat ../../models/cifar10-drift-detect.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10
    spec:
      storageUri: "gs://seldon-models/triton/tf_cifar10"
      requirements:
      - tensorflow
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10-outlier
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/outlier-detector"
      requirements:
        - mlserver
        - alibi-detect
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: cifar10-drift
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/drift-detector"
      requirements:
        - mlserver
        - alibi-detect
    
    cat ../../pipelines/cifar10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: cifar10-production
    spec:
      steps:
        - name: cifar10
        - name: cifar10-outlier
        - name: cifar10-drift
          batch:
            size: 20
      output:
        steps:
        - cifar10
        - cifar10-outlier.outputs.is_outlier
    
    infer("cifar10-production.pipeline",20, "normal")
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [1.45001495e-08, 1.2525752e-09, 1.6298458e-07, 0.11529388, 1.7431412e-07, 6.1856604e-06, 0.8846994, 6.0739285e-09, 7.437921e-08, 4.7317337e-09, 1.26449e-06, 4.8814868e-09, 1.5153439e-09, 8.490656e-09, 5.5131194e-10, 1.1617216e-09, 5.7729294e-10, 2.8839776e-07, 0.0006149016, 0.99938357, 0.888746, 2.5331951e-06, 0.00012967695, 0.10531583, 2.4284174e-05, 6.3332986e-06, 0.0016261435, 1.13079e-05, 0.0013286703, 0.0028091935, 2.0993439e-06, 3.680449e-08, 0.0013269952, 2.1766558e-05, 0.99841356, 0.00015300694, 6.9472035e-06, 1.3277059e-05, 6.1860555e-05, 3.4072806e-07, 1.1205097e-05, 0.99997175, 1.9948227e-07, 6.9880834e-08, 3.3387135e-08, 5.2603138e-08, 3.0352305e-07, 4.3738982e-08, 5.3243946e-07, 1.5870584e-05, 0.0006525102, 0.013322109, 1.480307e-06, 0.9766325, 4.9847167e-05, 0.00058075984, 0.008405659, 5.2234273e-06, 0.00023390084, 0.000116047224, 1.6682397e-06, 5.7737526e-10, 0.9975605, 6.45564e-05, 0.002371972, 1.0392675e-07, 9.747962e-08, 1.4484569e-07, 8.762438e-07, 2.4758325e-08, 5.028761e-09, 6.856381e-11, 5.9932094e-12, 4.921233e-10, 1.471166e-07, 2.7940719e-06, 3.4563383e-09, 0.99999714, 5.9420524e-10, 9.445026e-11, 4.1854888e-05, 5.041549e-08, 8.0302314e-08, 1.2119854e-07, 6.781646e-09, 1.2616152e-08, 1.1878505e-08, 1.628573e-09, 0.9999578, 3.281738e-08, 0.08930307, 1.4065135e-07, 4.1117343e-07, 0.90898305, 8.933351e-07, 0.0015637449, 0.00013868928, 9.092981e-06, 4.8759745e-07, 4.3976044e-07, 0.00016094849, 3.5653954e-07, 0.0760521, 0.8927447, 0.0011777573, 0.00265573, 0.027189083, 4.1892267e-06, 1.329405e-05, 1.8564688e-06, 1.3373891e-06, 1.0251247e-07, 8.651912e-09, 4.458202e-06, 1.4646349e-05, 1.260957e-06, 1.046087e-08, 0.9998946, 8.332438e-05, 3.900894e-07, 6.53852e-05, 3.012202e-08, 1.0247197e-07, 1.8824371e-06, 0.0004958526, 3.533475e-05, 2.739997e-07, 0.99939275, 4.840305e-06, 3.5346695e-06, 0.0005518078, 3.1597017e-07, 0.99902296, 0.00031509742, 8.07886e-07, 1.6366084e-06, 2.795575e-06, 6.112367e-06, 9.817249e-05, 2.602709e-07, 0.0004561966, 5.360607e-06, 2.8656412e-05, 0.000116040654, 6.881144e-05, 8.844774e-06, 4.4655946e-05, 3.5564542e-05, 0.006564381, 0.9926715, 0.007300911, 1.766928e-06, 3.0520596e-07, 0.026906287, 1.3769699e-06, 0.00027539674, 5.583593e-06, 3.792553e-06, 0.0003876767, 0.9651169, 0.18114138, 2.8360228e-05, 0.00019927241, 0.007685872, 0.00014663498, 3.9361137e-05, 5.941682e-05, 7.36174e-05, 0.79936546, 0.01126067, 2.3992783e-11, 7.6336457e-16, 1.4644799e-15, 1, 2.4652159e-14, 1.1786078e-10, 1.9402116e-13, 4.2408636e-15, 1.209294e-15, 2.9042784e-15, 1.5366902e-08, 1.2476195e-09, 1.3560152e-07, 0.999997, 4.3113017e-11, 2.8163534e-08, 2.4494727e-06, 1.3122828e-10, 3.8081083e-07, 2.1628158e-11, 0.0004926238, 6.9424555e-06, 2.827196e-05, 0.92534137, 9.500486e-06, 0.00036133997, 0.072713904, 1.2831057e-07, 0.0010457055, 2.8514464e-07], 'name': 'fc10', 'shape': [20, 10], 'datatype': 'FP32'}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect cifar10-production.cifar10-drift.outputs.is_drift
    seldon.default.model.cifar10-drift.outputs	cifeij8fh5ss738i5bp0	{"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
    
    infer("cifar10-production.pipeline",20, "drift")
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [8.080701e-09, 2.3025173e-12, 2.2681688e-09, 1, 4.1828953e-11, 4.48467e-09, 3.216822e-08, 2.8404365e-13, 5.217064e-09, 3.3497323e-13, 0.96965235, 4.7030144e-06, 1.6964266e-07, 1.7355454e-05, 2.6667e-06, 1.9505828e-06, 1.1363079e-07, 3.3352034e-08, 0.030320557, 1.7086056e-07, 0.03725602, 6.8623276e-06, 7.5557014e-05, 0.00018132397, 2.2838503e-05, 0.000110639296, 2.3732607e-06, 2.1210687e-06, 0.9623351, 7.131072e-06, 0.999079, 4.207448e-09, 1.5788535e-08, 2.723756e-08, 2.6555508e-11, 2.1526697e-10, 2.7599315e-10, 2.0737433e-10, 0.0009210062, 3.0885383e-09, 6.665241e-07, 1.7765576e-09, 1.4911559e-07, 0.9765331, 1.9476123e-07, 2.8244015e-06, 0.023463126, 5.8030287e-09, 3.243206e-09, 1.12179785e-08, 4.4123663e-06, 4.7628927e-09, 1.1727273e-08, 0.9761534, 1.1409252e-08, 8.922882e-05, 0.023752932, 3.1563903e-08, 2.7916305e-09, 8.7746266e-10, 1.0166265e-05, 0.999703, 4.5408615e-05, 0.00022673907, 1.7365853e-07, 1.0147362e-06, 6.253448e-06, 2.9711526e-07, 7.811687e-07, 6.183683e-06, 0.86618125, 5.47548e-07, 0.00038408802, 0.013155022, 3.6916779e-06, 0.0006137024, 0.11965008, 3.6425424e-06, 6.7638084e-06, 1.2372367e-06, 1.9545263e-05, 1.1281859e-13, 1.6811868e-14, 0.9999777, 1.9805435e-11, 2.7563674e-06, 2.9651657e-09, 1.1363432e-12, 2.9902746e-13, 1.220973e-12, 2.9895918e-05, 3.4964305e-07, 1.1331837e-08, 1.7012125e-06, 3.6088227e-07, 3.035954e-08, 2.2102333e-06, 1.7414077e-08, 0.9999455, 1.9921794e-05, 0.9999999, 5.3446598e-11, 6.3188843e-10, 1.0956511e-07, 1.1538642e-10, 8.113561e-10, 4.7179572e-08, 1.4544753e-11, 5.490219e-08, 1.3347151e-10, 1.5363307e-07, 6.604881e-09, 2.424105e-10, 9.963063e-09, 3.9349533e-09, 1.5709017e-09, 7.705774e-10, 4.8085802e-08, 1.8885139e-05, 0.9999809, 7.147243e-08, 3.143131e-13, 2.1447092e-13, 0.00042652222, 6.945973e-12, 0.9995734, 6.174434e-09, 4.1128205e-11, 3.4031404e-13, 8.573159e-15, 1.2226405e-09, 2.3768018e-10, 2.822187e-07, 8.016278e-08, 4.0692296e-08, 6.8023346e-06, 2.3926754e-07, 0.9999925, 6.652648e-09, 7.743497e-09, 7.6360675e-06, 5.9386625e-09, 1.5675019e-09, 2.136716e-07, 1.3074002e-06, 3.700079e-10, 1.0984521e-09, 6.2138824e-08, 0.9609078, 0.03908287, 0.0008332255, 7.696685e-08, 2.4428939e-09, 7.186676e-05, 1.4520063e-09, 1.4521317e-08, 1.09093e-06, 1.2531165e-10, 0.9990938, 5.798501e-09, 5.785368e-05, 3.82365e-09, 7.404351e-08, 0.008338481, 8.048078e-10, 0.99157715, 1.1663455e-05, 1.4583546e-05, 8.3543476e-08, 3.274394e-08, 2.4682688e-05, 1.3951502e-09, 1.0260489e-08, 0.9998845, 1.9418138e-08, 8.667954e-07, 2.1851054e-07, 8.917964e-05, 4.4437223e-07, 1.1292918e-07, 4.5302792e-07, 5.631744e-08, 2.9086214e-08, 3.1013877e-07, 7.695681e-09, 2.1452344e-09, 1.1493902e-08, 6.1980093e-10, 0.99999917, 1.1436694e-08, 2.42685e-05, 8.557389e-08, 0.024081504, 0.0073837163, 4.8152968e-05, 5.128531e-07, 0.9684405, 9.630179e-08, 2.1060101e-05, 1.901065e-07], 'name': 'fc10', 'shape': [20, 10], 'datatype': 'FP32'}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect cifar10-production.cifar10-drift.outputs.is_drift
    seldon.default.model.cifar10-drift.outputs	cifeimgfh5ss738i5bpg	{"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["1"]}}
    
    infer("cifar10-production.pipeline",1, "outlier")
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [6.3606867e-06, 0.0006106364, 0.0054279356, 0.6536454, 1.4738829e-05, 2.6104701e-06, 0.3397848, 1.3538776e-05, 0.0004458526, 4.807229e-05], 'name': 'fc10', 'shape': [1, 10], 'datatype': 'FP32'}, {'data': [1], 'name': 'is_outlier', 'shape': [1, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    infer("cifar10-production.pipeline",1, "ok")
    <Response [200]>
    {'model_name': '', 'outputs': [{'data': [1.45001495e-08, 1.2525752e-09, 1.6298458e-07, 0.11529388, 1.7431412e-07, 6.1856604e-06, 0.8846994, 6.0739285e-09, 7.43792e-08, 4.7317337e-09], 'name': 'fc10', 'shape': [1, 10], 'datatype': 'FP32'}, {'data': [0], 'name': 'is_outlier', 'shape': [1, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
    
    seldon pipeline inspect cifar10-production.cifar10.outputs
    seldon.default.model.cifar10.outputs	cifeiq8fh5ss738i5bqg	{"modelName":"cifar10_1", "modelVersion":"1", "outputs":[{"name":"fc10", "datatype":"FP32", "shape":["1", "10"], "contents":{"fp32Contents":[1.45001495e-8, 1.2525752e-9, 1.6298458e-7, 0.11529388, 1.7431412e-7, 0.0000061856604, 0.8846994, 6.0739285e-9, 7.43792e-8, 4.7317337e-9]}}]}
    
    cat ./models/sklearn1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    cat ./models/sklearn2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris2
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    seldon model load -f ./models/sklearn1.yaml
    seldon model load -f ./models/sklearn2.yaml
    {}
    {}
    
    seldon model status iris -w ModelAvailable
    seldon model status iris2 -w ModelAvailable
    {}
    {}
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris2 -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::50]
    
    cat ./experiments/ab-default-model.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    seldon experiment start -f ./experiments/ab-default-model.yaml
    seldon experiment status experiment-sample -w | jq -M .
    {
      "experimentName": "experiment-sample",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    from ipywebrtc import AudioRecorder, CameraStream
    import torchaudio
    from IPython.display import Audio
    import base64
    import json
    import requests
    import os
    import time
    reqJson = json.loads('{"inputs":[{"name":"args", "parameters": {"content_type": "base64"}, "data":[],"datatype":"BYTES","shape":[1]}]}')
    url = "http://0.0.0.0:9000/v2/models/model/infer"
    def infer(resource: str):
        with open('recording.webm', 'wb') as f:
            f.write(recorder.audio.value)
        !ffmpeg -i recording.webm -vn -ab 128k -ar 44100 file.mp3 -y -hide_banner -loglevel panic
        with open("file.mp3", mode='rb') as file:
            fileContent = file.read()
            encoded = base64.b64encode(fileContent)
            base64_message = encoded.decode('utf-8')
        reqJson["inputs"][0]["data"] = [str(base64_message)]
        headers = {"Content-Type": "application/json", "seldon-model": resource}
        response_raw = requests.post(url, json=reqJson, headers=headers)
        j = response_raw.json()
        sentiment = j["outputs"][0]["data"][0]
        text = j["outputs"][1]["data"][0]
        reqId = response_raw.headers["x-request-id"]
        print(reqId)
        os.environ["REQUEST_ID"]=reqId
        print(base64.b64decode(text))
        print(base64.b64decode(sentiment))

    Custom model requirements

    If the assigned server cannot load the model due to insufficient resources, another similarly-capable server can be selected to load the model.

    Explicit pinning

    If the specified server lacks sufficient memory or resources, the model load fails without trying another server.

    Configuring inference servers
    Configuring models
    Affinity and anti-affinity

    Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.

  • Ensure that you are performing these steps in the directory where you have downloaded the samples.

  • Get the IP address of the Seldon Core 2 instance running with Istio:

  • Make a note of the IP address that is displayed in the output. Replace <INGRESS_IP> with your service mesh's ingress IP address in the following commands.

    Create a Model

    Output is similar to:

    Make a gRPC inference call

    Delete the model

    Experiment

    Pipeline - model chain

    Pipeline - model join

    Explainer

    Seldon CLI
    .

    Seldon V2 Batch Examples

    Deploy Models and Pipelines

    Test Predictions

    MLServer Iris Batch Job

    Triton TFSimple Batch Job

    Cleanup

    Seldon CLI
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::27 :iris_1::23]
    
    seldon model infer iris --show-headers \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    > POST /v2/models/iris/infer HTTP/1.1
    > Host: localhost:9000
    > Content-Type:[application/json]
    > Seldon-Model:[iris]
    
    < X-Seldon-Route:[:iris_1:]
    < Ce-Id:[463e96ad-645f-4442-8890-4c340b58820b]
    < Traceparent:[00-fe9e87fcbe4be98ed82fb76166e15ceb-d35e7ac96bd8b718-01]
    < X-Envoy-Upstream-Service-Time:[3]
    < Ce-Specversion:[0.3]
    < Date:[Thu, 29 Jun 2023 14:03:03 GMT]
    < Ce-Source:[io.seldon.serving.deployment.mlserver]
    < Content-Type:[application/json]
    < Server:[envoy]
    < X-Request-Id:[cieou5ofh5ss73fbjdu0]
    < Ce-Endpoint:[iris_1]
    < Ce-Modelid:[iris_1]
    < Ce-Type:[io.seldon.serving.inference.response]
    < Content-Length:[213]
    < Ce-Inferenceservicename:[mlserver]
    < Ce-Requestid:[463e96ad-645f-4442-8890-4c340b58820b]
    
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "463e96ad-645f-4442-8890-4c340b58820b",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    seldon model infer iris -s -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    seldon model infer iris --inference-mode grpc -s -i 50\
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'
    Success: map[:iris_1::50]
    
    seldon experiment stop experiment-sample
    seldon model unload iris
    seldon model unload iris2
    cat ./models/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    
    cat ./models/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    
    seldon model load -f ./models/add10.yaml
    seldon model load -f ./models/mul10.yaml
    {}
    {}
    
    seldon model status add10 -w ModelAvailable
    seldon model status mul10 -w ModelAvailable
    {}
    {}
    
    cat ./pipelines/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline-mul10
    spec:
      steps:
        - name: mul10
      output:
        steps:
        - mul10
    
    cat ./pipelines/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline-add10
    spec:
      steps:
        - name: add10
      output:
        steps:
        - add10
    
    seldon pipeline load -f ./pipelines/add10.yaml
    seldon pipeline load -f ./pipelines/mul10.yaml
    seldon pipeline status pipeline-add10 -w PipelineReady
    seldon pipeline status pipeline-mul10 -w PipelineReady
    {"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"cieov47l80lc739juklg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.460868091Z", "modelsReady":true}}]}
    {"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"cieov47l80lc739jukm0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.631980330Z", "modelsReady":true}}]}
    
    seldon pipeline infer pipeline-add10 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              11,
              12,
              13,
              14
            ]
          }
        }
      ]
    }
    
    seldon pipeline infer pipeline-mul10 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              10,
              20,
              30,
              40
            ]
          }
        }
      ]
    }
    
    cat ./experiments/addmul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: addmul10
    spec:
      default: pipeline-add10
      resourceType: pipeline
      candidates:
      - name: pipeline-add10
        weight: 50
      - name: pipeline-mul10
        weight: 50
    
    seldon experiment start -f ./experiments/addmul10.yaml
    seldon experiment status addmul10 -w | jq -M .
    {
      "experimentName": "addmul10",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon pipeline infer pipeline-add10 -i 50 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    Success: map[:add10_1::28 :mul10_1::22 :pipeline-add10.pipeline::28 :pipeline-mul10.pipeline::22]
    
    seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    > /inference.GRPCInferenceService/ModelInfer HTTP/2
    > Host: localhost:9000
    > seldon-model:[pipeline-add10.pipeline]
    
    < x-envoy-expected-rq-timeout-ms:[60000]
    < x-request-id:[cieov8ofh5ss739277i0]
    < date:[Thu, 29 Jun 2023 14:05:23 GMT]
    < server:[envoy]
    < content-type:[application/grpc]
    < x-envoy-upstream-service-time:[6]
    < x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
    < x-forwarded-proto:[http]
    
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
    
    seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    > /inference.GRPCInferenceService/ModelInfer HTTP/2
    > Host: localhost:9000
    > x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
    > seldon-model:[pipeline-add10.pipeline]
    
    < content-type:[application/grpc]
    < x-forwarded-proto:[http]
    < x-envoy-expected-rq-timeout-ms:[60000]
    < x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :pipeline-add10.pipeline:]
    < x-request-id:[cieov90fh5ss739277ig]
    < x-envoy-upstream-service-time:[7]
    < date:[Thu, 29 Jun 2023 14:05:24 GMT]
    < server:[envoy]
    
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
    
    seldon pipeline infer pipeline-add10 -s -i 50 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    Success: map[:add10_1::50 :pipeline-add10.pipeline::150]
    
    cat ./models/add20.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add20
    spec:
      storageUri: "gs://seldon-models/triton/add20"
      requirements:
      - triton
      - python
    
    seldon model load -f ./models/add20.yaml
    {}
    
    seldon model status add20 -w ModelAvailable
    {}
    
    cat ./experiments/add1020.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: add1020
    spec:
      default: add10
      candidates:
      - name: add10
        weight: 50
      - name: add20
        weight: 50
    
    seldon experiment start -f ./experiments/add1020.yaml
    seldon experiment status add1020 -w | jq -M .
    {
      "experimentName": "add1020",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer add10 -i 50  --inference-mode grpc \
      '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    Success: map[:add10_1::22 :add20_1::28]
    
    seldon pipeline infer pipeline-add10 -i 100 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    Success: map[:add10_1::24 :add20_1::32 :mul10_1::44 :pipeline-add10.pipeline::56 :pipeline-mul10.pipeline::44]
    
    seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    > /inference.GRPCInferenceService/ModelInfer HTTP/2
    > Host: localhost:9000
    > seldon-model:[pipeline-add10.pipeline]
    
    < x-request-id:[cieovf0fh5ss739279u0]
    < x-envoy-upstream-service-time:[5]
    < x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
    < date:[Thu, 29 Jun 2023 14:05:48 GMT]
    < server:[envoy]
    < content-type:[application/grpc]
    < x-forwarded-proto:[http]
    < x-envoy-expected-rq-timeout-ms:[60000]
    
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
    
    seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    > /inference.GRPCInferenceService/ModelInfer HTTP/2
    > Host: localhost:9000
    > x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
    > seldon-model:[pipeline-add10.pipeline]
    
    < x-forwarded-proto:[http]
    < x-envoy-expected-rq-timeout-ms:[60000]
    < x-request-id:[cieovf8fh5ss739279ug]
    < x-envoy-upstream-service-time:[6]
    < date:[Thu, 29 Jun 2023 14:05:49 GMT]
    < server:[envoy]
    < content-type:[application/grpc]
    < x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :add20_1: :pipeline-add10.pipeline:]
    
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[21, 22, 23, 24]}}]}
    
    seldon experiment stop addmul10
    seldon experiment stop add1020
    seldon pipeline unload pipeline-add10
    seldon pipeline unload pipeline-mul10
    seldon model unload add10
    seldon model unload add20
    seldon model unload mul10
    cat ./models/sklearn1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    cat ./models/sklearn2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris2
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    seldon model load -f ./models/sklearn1.yaml
    seldon model load -f ./models/sklearn2.yaml
    {}
    {}
    
    seldon model status iris -w ModelAvailable
    seldon model status iris2 -w ModelAvailable
    {}
    {}
    
    cat ./experiments/sklearn-mirror.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: sklearn-mirror
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 100
      mirror:
        name: iris2
        percent: 100
    
    seldon experiment start -f ./experiments/sklearn-mirror.yaml
    seldon experiment status sklearn-mirror -w | jq -M .
    {
      "experimentName": "sklearn-mirror",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon model infer iris -i 50 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris_1::50]
    
    curl -s 0.0.0:9006/metrics | grep seldon_model_infer_total | grep iris2_1
    seldon_model_infer_total{code="200",method_type="rest",model="iris",model_internal="iris2_1",server="mlserver",server_replica="0"} 50
    
    seldon experiment stop sklearn-mirror
    seldon model unload iris
    seldon model unload iris2
    cat ./models/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    
    cat ./models/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    
    seldon model load -f ./models/add10.yaml
    seldon model load -f ./models/mul10.yaml
    {}
    {}
    
    seldon model status add10 -w ModelAvailable
    seldon model status mul10 -w ModelAvailable
    {}
    {}
    
    cat ./pipelines/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline-mul10
    spec:
      steps:
        - name: mul10
      output:
        steps:
        - mul10
    
    cat ./pipelines/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline-add10
    spec:
      steps:
        - name: add10
      output:
        steps:
        - add10
    
    seldon pipeline load -f ./pipelines/add10.yaml
    seldon pipeline load -f ./pipelines/mul10.yaml
    seldon pipeline status pipeline-add10 -w PipelineReady
    seldon pipeline status pipeline-mul10 -w PipelineReady
    {"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"ciep072i8ufs73flaipg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:24.903503109Z", "modelsReady":true}}]}
    {"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"ciep072i8ufs73flaiq0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:25.082642153Z", "modelsReady":true}}]}
    
    seldon pipeline infer pipeline-add10 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
    
    seldon pipeline infer pipeline-mul10 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
    
    cat ./experiments/addmul10-mirror.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: addmul10-mirror
    spec:
      default: pipeline-add10
      resourceType: pipeline
      candidates:
      - name: pipeline-add10
        weight: 100
      mirror:
        name: pipeline-mul10
        percent: 100
    
    seldon experiment start -f ./experiments/addmul10-mirror.yaml
    seldon experiment status addmul10-mirror -w | jq -M .
    {
      "experimentName": "addmul10-mirror",
      "active": true,
      "candidatesReady": true,
      "mirrorReady": true,
      "statusDescription": "experiment active",
      "kubernetesMeta": {}
    }
    
    seldon pipeline infer pipeline-add10 -i 1 --inference-mode grpc \
     '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'
    {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
    
    curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1
    seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 2
    
    curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1
    seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 2
    
    seldon pipeline infer pipeline-add10 -i 1 \
     '{"model_name":"add10","inputs":[{"name":"INPUT","data":[1,2,3,4],"datatype":"FP32","shape":[4]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				11,
    				12,
    				13,
    				14
    			],
    			"name": "OUTPUT",
    			"shape": [
    				4
    			],
    			"datatype": "FP32"
    		}
    	]
    }
    
    curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1
    seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 3
    
    curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1
    seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 3
    
    seldon pipeline inspect pipeline-mul10
    seldon.default.model.mul10.inputs	ciep0bofh5ss73dpdiq0	{"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
    seldon.default.model.mul10.outputs	ciep0bofh5ss73dpdiq0	{"modelName":"mul10_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
    seldon.default.pipeline.pipeline-mul10.inputs	ciep0bofh5ss73dpdiq0	{"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
    seldon.default.pipeline.pipeline-mul10.outputs	ciep0bofh5ss73dpdiq0	{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
    
    seldon experiment stop addmul10-mirror
    seldon pipeline unload pipeline-add10
    seldon pipeline unload pipeline-mul10
    seldon model unload add10
    seldon model unload mul10
    cat ../../models/hf-whisper.yaml
    echo "---"
    cat ../../models/hf-sentiment.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: whisper
    spec:
      storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
      requirements:
      - huggingface
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment
    spec:
      storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
      requirements:
      - huggingface
    
    kubectl apply -f ../../models/hf-whisper.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/hf-sentiment.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/whisper created
    model.mlops.seldon.io/sentiment created
    kubectl wait --for condition=ready --timeout=300s model whisper -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model sentiment -n ${NAMESPACE}
    model.mlops.seldon.io/whisper condition met
    model.mlops.seldon.io/sentiment condition met
    seldon model load -f ../../models/hf-whisper.yaml
    seldon model load -f ../../models/hf-sentiment.yaml
    {}
    {}
    
    seldon model status whisper -w ModelAvailable | jq -M .
    seldon model status sentiment -w ModelAvailable | jq -M .
    {}
    {}
    
    cat ./sentiment-input-transform/model.py | pygmentize
    # Copyright (c) 2024 Seldon Technologies Ltd.
    
    # Use of this software is governed BY
    # (1) the license included in the LICENSE file or
    # (2) if the license included in the LICENSE file is the Business Source License 1.1,
    # the Change License after the Change Date as each is defined in accordance with the LICENSE file.
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs.string import StringRequestCodec
    from mlserver.logging import logger
    import json
    
    
    class SentimentInputTransformRuntime(MLModel):
    
      async def load(self) -> bool:
        return self.ready
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        logger.info("payload (input-transform): %s",payload)
        res_list = self.decode_request(payload, default_codec=StringRequestCodec)
        logger.info("res list (input-transform): %s",res_list)
        texts = []
        for res in res_list:
          logger.info("decoded data (input-transform): %s", res)
          #text = json.loads(res)
          text = res
          texts.append(text["text"])
    
        logger.info("transformed data (input-transform): %s", texts)
        response =  StringRequestCodec.encode_response(
          model_name="sentiment",
          payload=texts
        )
        logger.info("response (input-transform): %s", response)
        return response
    
    cat ./sentiment-output-transform/model.py | pygmentize
    # Copyright (c) 2024 Seldon Technologies Ltd.
    
    # Use of this software is governed BY
    # (1) the license included in the LICENSE file or
    # (2) if the license included in the LICENSE file is the Business Source License 1.1,
    # the Change License after the Change Date as each is defined in accordance with the LICENSE file.
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs import StringCodec, Base64Codec, NumpyRequestCodec
    from mlserver.codecs.string import StringRequestCodec
    from mlserver.codecs.numpy import NumpyRequestCodec
    import base64
    from mlserver.logging import logger
    import numpy as np
    import json
    
    class SentimentOutputTransformRuntime(MLModel):
    
      async def load(self) -> bool:
        return self.ready
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        logger.info("payload (output-transform): %s",payload)
        res_list = self.decode_request(payload, default_codec=StringRequestCodec)
        logger.info("res list (output-transform): %s",res_list)
        scores = []
        for res in res_list:
          logger.debug("decoded data (output transform): %s",res)
          #sentiment = json.loads(res)
          sentiment = res
          if sentiment["label"] == "POSITIVE":
            scores.append(1)
          else:
            scores.append(0)
        response =  NumpyRequestCodec.encode_response(
          model_name="sentiments",
          payload=np.array(scores)
        )
        logger.info("response (output-transform): %s", response)
        return response
    
    cat ../../models/hf-sentiment-input-transform.yaml
    echo "---"
    cat ../../models/hf-sentiment-output-transform.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment-input-transform
    spec:
      storageUri: "gs://seldon-models/scv2/examples/huggingface/mlserver_1.3.5/sentiment-input-transform"
      requirements:
      - mlserver
      - python
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment-output-transform
    spec:
      storageUri: "gs://seldon-models/scv2/examples/huggingface/mlserver_1.3.5/sentiment-output-transform"
      requirements:
      - mlserver
      - python
    
    kubectl apply -f ../../models/hf-sentiment-input-transform.yaml -n ${NAMESPACE}
    kubectl apply -f ../../models/hf-sentiment-output-transform.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-input-transform created
    model.mlops.seldon.io/sentiment-output-transform created
    kubectl wait --for condition=ready --timeout=300s model sentiment-input-transform -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s model sentiment-output-transform -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-input-transform condition met
    model.mlops.seldon.io/sentiment-output-transform condition met
    seldon model load -f ../../models/hf-sentiment-input-transform.yaml
    seldon model load -f ../../models/hf-sentiment-output-transform.yaml
    {}
    {}
    
    seldon model status sentiment-input-transform -w ModelAvailable | jq -M .
    seldon model status sentiment-output-transform -w ModelAvailable | jq -M .
    {}
    {}
    cat ../../pipelines/sentiment-explain.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: sentiment-explain
    spec:
      steps:
        - name: sentiment
          tensorMap:
            sentiment-explain.inputs.predict: array_inputs
        - name: sentiment-output-transform
          inputs:
          - sentiment
      output:
        steps:
        - sentiment-output-transform
    
    kubectl apply -f ../../models/sentiment-explain.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-explain created
    kubectl wait --for condition=ready --timeout=300s model sentiment-explain -n ${NAMESPACE}
    model.mlops.seldon.io/sentiment-explain condition met
    seldon pipeline load -f ../../pipelines/sentiment-explain.yaml
    seldon pipeline status sentiment-explain -w PipelineReady | jq -M .
    {
      "pipelineName": "sentiment-explain",
      "versions": [
        {
          "pipeline": {
            "name": "sentiment-explain",
            "uid": "cihuo3svgtec73bj6ncg",
            "version": 2,
            "steps": [
              {
                "name": "sentiment",
                "tensorMap": {
                  "sentiment-explain.inputs.predict": "array_inputs"
                }
              },
              {
                "name": "sentiment-output-transform",
                "inputs": [
                  "sentiment.outputs"
                ]
              }
            ],
            "output": {
              "steps": [
                "sentiment-output-transform.outputs"
              ]
            },
            "kubernetesMeta": {}
          },
          "state": {
            "pipelineVersion": 2,
            "status": "PipelineReady",
            "reason": "created pipeline",
            "lastChangeTimestamp": "2023-07-04T09:53:19.250753906Z",
            "modelsReady": true
          }
        }
      ]
    }
    
    cat ../../models/hf-sentiment-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: sentiment-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
      explainer:
        type: anchor_text
        pipelineRef: sentiment-explain
    
    kubectl apply -f .../../pipelines/hf-sentiment-explainer.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/hf-sentiment-explainer created
    kubectl wait --for condition=ready --timeout=300s model hf-sentiment-explainer -n ${NAMESPACE}
    model.mlops.seldon.io/hf-sentiment-explainer condition met
    seldon model load -f ../../models/hf-sentiment-explainer.yaml
    {}
    
    seldon model status sentiment-explainer -w ModelAvailable | jq -M .
    Error: Model wait status timeout
    cat ../../pipelines/speech-to-sentiment.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: speech-to-sentiment
    spec:
      steps:
        - name: whisper
        - name: sentiment
          inputs:
          - whisper
          tensorMap:
            whisper.outputs.output: args
        - name: sentiment-input-transform
          inputs:
          - whisper
        - name: sentiment-explainer
          inputs:
          - sentiment-input-transform
      output:
        steps:
        - sentiment
        - whisper
    
    kubectl apply -f .../../pipelines/speech-to-sentiment.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/speech-to-sentiment created
    kubectl wait --for condition=ready --timeout=300s model speech-to-sentiment -n ${NAMESPACE}
    model.mlops.seldon.io/speech-to-sentiment condition met
    seldon pipeline load -f ../../pipelines/speech-to-sentiment.yaml
    seldon pipeline status speech-to-sentiment -w PipelineReady | jq -M .
    {
      "pipelineName": "speech-to-sentiment",
      "versions": [
        {
          "pipeline": {
            "name": "speech-to-sentiment",
            "uid": "cihuqb4vgtec73bj6nd0",
            "version": 2,
            "steps": [
              {
                "name": "sentiment",
                "inputs": [
                  "whisper.outputs"
                ],
                "tensorMap": {
                  "whisper.outputs.output": "args"
                }
              },
              {
                "name": "sentiment-explainer",
                "inputs": [
                  "sentiment-input-transform.outputs"
                ]
              },
              {
                "name": "sentiment-input-transform",
                "inputs": [
                  "whisper.outputs"
                ]
              },
              {
                "name": "whisper"
              }
            ],
            "output": {
              "steps": [
                "sentiment.outputs",
                "whisper.outputs"
              ]
            },
            "kubernetesMeta": {}
          },
          "state": {
            "pipelineVersion": 2,
            "status": "PipelineReady",
            "reason": "created pipeline",
            "lastChangeTimestamp": "2023-07-04T09:58:04.277171896Z",
            "modelsReady": true
          }
        }
      ]
    }
    
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera)
    recorder
    AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …
    
    infer("speech-to-sentiment.pipeline")
    cihuqm8fh5ss73der5gg
    b'{"text": " Cambridge is a great place."}'
    b'{"label": "POSITIVE", "score": 0.9998548030853271}'
    
    while True:
        base64Res = !seldon pipeline inspect speech-to-sentiment.sentiment-explainer.outputs --format json \
              --request-id ${REQUEST_ID}
        j = json.loads(base64Res[0])
        if j["topics"][0]["msgs"] is not None:
            expBase64 = j["topics"][0]["msgs"][0]["value"]["outputs"][0]["contents"]["bytesContents"][0]
            expRaw = base64.b64decode(expBase64)
            exp = json.loads(expRaw)
            print("")
            print("Explanation anchors:",exp["data"]["anchor"])
            break
        else:
            print(".",end='')
            time.sleep(1)
    
    ......
    Explanation anchors: ['great']
    
    kubectl delete -f ../../pipelines/speech-to-sentiment.yaml -n ${NAMESPACE}
    kubectl delete -f ../../pipelines/sentiment-explain.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/hf-whisper.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/hf-sentiment.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/hf-sentiment-input-transform.yaml -n ${NAMESPACE}
    kubectl delete -f ../../models/hf-sentiment-output-transform.yaml -n ${NAMESPACE}
    seldon pipeline unload speech-to-sentiment
    seldon pipeline unload sentiment-explain
    seldon model unload whisper
    seldon model unload sentiment
    seldon model unload sentiment-explainer
    seldon model unload sentiment-output-transform
    seldon model unload sentiment-input-transform
    
    apiVersion: v1
    kind: Node
    metadata:
      name: example-node         # Replace with the actual node name
      labels:
        pool: infer-srv          # Custom label
        nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb-SHARED  # Sample label from GPU discovery
        cloud.google.com/gke-accelerator: nvidia-a100-80gb      # GKE without NVIDIA GPU operator
        cloud.google.com/gke-accelerator-count: "2"              # Accelerator count
    spec:
      taints:
        - effect: NoSchedule
          key: seldon-gpu-srv
          value: "true"
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-llm-local-gpu     # <server name>
      namespace: seldon-mesh            # <seldon runtime namespace>
    spec:
      replicas: 1
      serverConfig: mlserver            # <reference Serverconfig CR>
      extraCapabilities:
        - model-on-gpu                  # Custom capability for matching Model to this server
      podSpec:
        nodeSelector:                   # Schedule pods only on nodes with these labels
          pool: infer-srv
          cloud.google.com/gke-accelerator: nvidia-a100-80gb  # Example requesting specific GPU on GKE
          # cloud.google.com/gke-accelerator-count: 2          # Optional GPU count
        tolerations:                    # Allow scheduling on nodes with the matching taint
          - effect: NoSchedule
            key: seldon-gpu-srv
            operator: Equal
            value: "true"
        containers:                     # Override settings from Serverconfig if needed
          - name: mlserver
            resources:
              requests:
                nvidia.com/gpu: 1       # Request a GPU for the mlserver container
                cpu: 40
                memory: 360Gi
                ephemeral-storage: 290Gi
              limits:
                nvidia.com/gpu: 2       # Limit to 2 GPUs
                cpu: 40
                memory: 360Gi
    
    					
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-llm-local-gpu     # <server name>
      namespace: seldon-mesh            # <seldon runtime namespace>
    spec:
      podSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: "pool"
                  operator: In
                  values:
                  - infer-srv
                - key: "cloud.google.com/gke-accelerator"
                  operator: In
                  values:
                  - nvidia-a100-80gb
        tolerations:                     # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
        - effect: NoSchedule
          key: seldon-gpu-srv
          operator: Equal
          value: "true"
        containers:                      # If needed, override settings from ServerConfig for this specific Server
          - name: mlserver
            resources:
              requests:
                nvidia.com/gpu: 1        # Request a GPU for the mlserver container
                cpu: 40
                memory: 360Gi
                ephemeral-storage: 290Gi
              limits:
                nvidia.com/gpu: 2        # Limit to 2 GPUs
                cpu: 40
                memory: 360Gi
    
    					
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-llm-local-gpu     # <server name>
      namespace: seldon-mesh            # <seldon runtime namespace>
    spec:
      podSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                  - key: "cloud.google.com/gke-accelerator-count"
                    operator: Gt       # (greater than)
                    values: ["1"]
                  - key: "gpu.gpu-vendor.example/installed-memory"
                    operator: Gt
                    values: ["75000"]
                  - key: "feature.node.kubernetes.io/pci-10.present" # NFD Feature label
                    operator: In
                    values: ["true"] # (optional) only schedule on nodes with PCI device 10
    
        tolerations:                     # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
        - effect: NoSchedule
          key: seldon-gpu-srv
          operator: Equal
          value: "true"
    
        containers:                      # If needed, override settings from ServerConfig for this specific Server
          - name: mlserver
            env:
              ...                        # Add your environment variables here
            image: ...                   # Specify your container image here
            resources:
              requests:
                nvidia.com/gpu: 1        # Request a GPU for the mlserver container
                cpu: 40
                memory: 360Gi
                ephemeral-storage: 290Gi
              limits:
                nvidia.com/gpu: 2        # Limit to 2 GPUs
                cpu: 40
                memory: 360Gi
            ...                           # Other configurations can go here
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: ServerConfig
    metadata:
      name: mlserver-llm              # <ServerConfig name>
      namespace: seldon-mesh           # <seldon runtime namespace>
    spec:
      podSpec:
        nodeSelector:                  # Schedule pods only on nodes with these labels
          pool: infer-srv
          cloud.google.com/gke-accelerator: nvidia-a100-80gb  # Example requesting specific GPU on GKE
          # cloud.google.com/gke-accelerator-count: 2          # Optional GPU count
        tolerations:                   # Allow scheduling on nodes with the matching taint
          - effect: NoSchedule
            key: seldon-gpu-srv
            operator: Equal
            value: "true"
        containers:                    # Define the container specifications
          - name: mlserver
            env:                       # Environment variables (fill in as needed)
              ...
            image: ...                 # Specify the container image
            resources:
              requests:
                nvidia.com/gpu: 1      # Request a GPU for the mlserver container
                cpu: 40
                memory: 360Gi
                ephemeral-storage: 290Gi
              limits:
                nvidia.com/gpu: 2      # Limit to 2 GPUs
                cpu: 40
                memory: 360Gi
            ...                        # Additional container configurations
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: llama3           # <model name>
      namespace: seldon-mesh # <seldon runtime namespace>
    spec:
      requirements:
      - model-on-gpu         # requirement matching a Server capability
    
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    metadata:
      name: mlserver-llm-local-gpu     # <server name>
      namespace: seldon-mesh           # <seldon runtime namespace>
    spec:
      serverConfig: mlserver           # <reference ServerConfig CR>
      extraCapabilities:
        - model-on-gpu                 # custom capability that can be used for matching Model to this server
      # Other fields would go here
    apiVersion: mlops.seldon.io/v1alpha1
    kind: ServerConfig
    metadata:
      name: mlserver-llm               # <ServerConfig name>
      namespace: seldon-mesh           # <seldon runtime namespace>
    spec:
      podSpec:
        containers:
          - name: agent                # note the setting is applied to the agent container
            env:
              - name: SELDON_SERVER_CAPABILITIES
                value: mlserver,alibi-detect,...,xgboost,model-on-gpu  # add capability to the list
            image: ...
        # Other configurations go here
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: llama3           # <model name>
      namespace: seldon-mesh # <seldon runtime namespace>
    spec:
      server: mlserver-llm-local-gpu   # <reference Server CR>
      requirements:
        - model-on-gpu                # requirement matching a Server capability
    
    curl -k http://<INGRESS_IP>:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: iris" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "shape": [1, 4],
            "datatype": "FP32",
            "data": [[1, 2, 3, 4]]
          }
        ]
      }' | jq
    
    seldon model infer iris --inference-host <INGRESS_IP>:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    for i in {1..10}; do 
      curl -s -k <INGRESS_IP>:80/v2/models/experiment-sample/infer \
        -H "Host: seldon-mesh.inference.seldon" \
        -H "Content-Type: application/json" \
        -H "Seldon-Model: experiment-sample.experiment" \
        -d '{"inputs":[{"name":"predict","shape":[1,4],"datatype":"FP32","data":[[1,2,3,4]]}]}' \
        | jq -r .model_name
    done | sort | uniq -c
    
     4 iris2_1
     6 iris_1
    
    seldon model infer --inference-host <INGRESS_IP>:80 -i 10 iris \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    Success: map[:iris2_1::4 :iris_1::6]
    
    curl -k <INGRESS_IP>:80/v2/models/tfsimples/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimples.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' |jq 
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimples --inference-mode grpc --inference-host <INGRESS_IP>:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    
    curl -k <INGRESS_IP>:80/v2/models/join/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: join.pipeline" \
      -d '{
        "model_name": "simple",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' |jq
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer join --inference-mode grpc --inference-host <INGRESS_IP>:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    cat ./models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    
    kubectl create -f ./models/sklearn-iris-gs.yaml -n seldon-mesh
    model.mlops.seldon.io/iris created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/iris condition met
    
    kubectl get model iris -n seldon-mesh -o jsonpath='{.status}' | jq -M .
    {
      "conditions": [
        {
          "lastTransitionTime": "2023-06-30T10:01:52Z",
          "message": "ModelAvailable",
          "status": "True",
          "type": "ModelReady"
        },
        {
          "lastTransitionTime": "2023-06-30T10:01:52Z",
          "status": "True",
          "type": "Ready"
        }
      ],
      "replicas": 1
    }
    Make a REST inference call
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "7fd401e1-3dce-46f5-9668-902aea652b89",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    seldon model infer iris --inference-mode grpc --inference-host <INGRESS_IP>:80 \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .
    {
      "modelName": "iris_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "predict",
          "datatype": "INT64",
          "shape": [
            "1",
            "1"
          ],
          "parameters": {
            "content_type": {
              "stringParam": "np"
            }
          },
          "contents": {
            "int64Contents": [
              "2"
            ]
          }
        }
      ]
    }
    
    kubectl get server mlserver -n seldon-mesh -o jsonpath='{.status}' | jq -M .
    {
      "conditions": [
        {
          "lastTransitionTime": "2023-06-30T09:59:12Z",
          "status": "True",
          "type": "Ready"
        },
        {
          "lastTransitionTime": "2023-06-30T09:59:12Z",
          "reason": "StatefulSet replicas matches desired replicas",
          "status": "True",
          "type": "StatefulSetReady"
        }
      ],
      "loadedModels": 1,
      "replicas": 1,
      "selector": "seldon-server-name=mlserver"
    }
    
    kubectl delete -f ./models/sklearn-iris-gs.yaml -n seldon-mesh
    model.mlops.seldon.io "iris" deleted
    
    cat ./models/sklearn1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    cat ./models/sklearn2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris2
    spec:
      storageUri: "gs://seldon-models/mlserver/iris"
      requirements:
      - sklearn
    
    kubectl create -f ./models/sklearn1.yaml -n seldon-mesh
    kubectl create -f ./models/sklearn2.yaml -n seldon-mesh
    model.mlops.seldon.io/iris created
    model.mlops.seldon.io/iris2 created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/iris condition met
    model.mlops.seldon.io/iris2 condition met
    
    cat ./experiments/ab-default-model.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Experiment
    metadata:
      name: experiment-sample
    spec:
      default: iris
      candidates:
      - name: iris
        weight: 50
      - name: iris2
        weight: 50
    
    kubectl create -f ./experiments/ab-default-model.yaml -n seldon-mesh
    experiment.mlops.seldon.io/experiment-sample created
    
    kubectl wait --for condition=ready --timeout=300s experiment --all -n seldon-mesh
    experiment.mlops.seldon.io/experiment-sample condition met
    
    kubectl delete -f ./experiments/ab-default-model.yaml -n seldon-mesh
    kubectl delete -f ./models/sklearn1.yaml -n seldon-mesh
    kubectl delete -f ./models/sklearn2.yaml -n seldon-mesh
    experiment.mlops.seldon.io "experiment-sample" deleted
    model.mlops.seldon.io "iris" deleted
    model.mlops.seldon.io "iris2" deleted
    
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    
    cat ./pipelines/tfsimples.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimples
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
          inputs:
          - tfsimple1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple1.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimples.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples condition met
    
    kubectl delete -f ./pipelines/tfsimples.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "tfsimples" deleted
    
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    cat ./models/tfsimple3.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple3
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple3.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    model.mlops.seldon.io/tfsimple3 created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    model.mlops.seldon.io/tfsimple3 condition met
    
    cat ./pipelines/tfsimples-join.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: join
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
        - name: tfsimple3
          inputs:
          - tfsimple1.outputs.OUTPUT0
          - tfsimple2.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple2.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple3
    
    kubectl create -f ./pipelines/tfsimples-join.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/join created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/join condition met
    
    kubectl delete -f ./pipelines/tfsimples-join.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "join" deleted
    
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple3.yaml -n seldon-mesh
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    model.mlops.seldon.io "tfsimple3" deleted
    
    cat ./models/income.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
      requirements:
      - sklearn
    
    kubectl create -f ./models/income.yaml -n seldon-mesh
    model.mlops.seldon.io/income created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/income condition met
    
    kubectl get model income -n seldon-mesh -o jsonpath='{.status}' | jq -M .
    {
      "availableReplicas": 1,
      "conditions": [
        {
          "lastTransitionTime": "2025-10-30T08:37:24Z",
          "message": "ModelAvailable",
          "status": "True",
          "type": "ModelReady"
        },
        {
          "lastTransitionTime": "2025-10-30T08:37:24Z",
          "status": "True",
          "type": "Ready"
        }
      ],
      "modelgwReady": "ModelAvailable(1/1 ready ) ",
      "replicas": 1,
      "selector": "server=mlserver"
    }
    seldon model infer income --inference-host <INGRESS_IP>:80 \
         '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income_1",
    	"model_version": "1",
    	"id": "cdf32df2-eb42-42d8-9f66-404bcab95540",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				0
    			]
    		}
    	]
    }
    
    cat ./models/income-explainer.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: income-explainer
    spec:
      storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
      explainer:
        type: anchor_tabular
        modelRef: income
    
    kubectl create -f ./models/income-explainer.yaml -n seldon-mesh
    model.mlops.seldon.io/income-explainer created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/income condition met
    model.mlops.seldon.io/income-explainer condition met
    
    kubectl get model income-explainer -n seldon-mesh -o jsonpath='{.status}' | jq -M .
    {
      "conditions": [
        {
          "lastTransitionTime": "2023-06-30T10:03:07Z",
          "message": "ModelAvailable",
          "status": "True",
          "type": "ModelReady"
        },
        {
          "lastTransitionTime": "2023-06-30T10:03:07Z",
          "status": "True",
          "type": "Ready"
        }
      ],
      "replicas": 1
    }
    
    seldon model infer income-explainer --inference-host <INGRESS_IP>:80 \
         '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'
    {
    	"model_name": "income-explainer_1",
    	"model_version": "1",
    	"id": "3028a904-9bb3-42d7-bdb7-6e6993323ed7",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "explanation",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "BYTES",
    			"parameters": {
    				"content_type": "str"
    			},
    			"data": [
    				"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.1\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9705882352941176, \"coverage\": 0.0699, \"raw\": {\"feature\": [3, 5], \"mean\": [0.8094218415417559, 0.9705882352941176], \"precision\": [0.8094218415417559, 0.9705882352941176], \"coverage\": [0.3036, 0.0699], \"examples\": [{\"covered_true\": [[23, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [44, 4, 1, 1, 8, 0, 4, 1, 0, 0, 40, 9], [60, 2, 5, 1, 5, 1, 4, 0, 0, 0, 25, 9], [52, 4, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [66, 6, 1, 1, 8, 0, 4, 1, 0, 0, 8, 9], [52, 4, 1, 1, 8, 0, 4, 1, 0, 0, 40, 9], [27, 4, 1, 1, 1, 1, 4, 1, 0, 0, 35, 9], [48, 4, 1, 1, 6, 0, 4, 1, 0, 0, 45, 9], [45, 6, 1, 1, 5, 0, 4, 1, 0, 0, 40, 9], [40, 2, 1, 1, 5, 4, 4, 0, 0, 0, 45, 9]], \"covered_false\": [[42, 6, 5, 1, 6, 0, 4, 1, 99999, 0, 80, 9], [29, 4, 1, 1, 8, 1, 4, 1, 0, 0, 50, 9], [49, 4, 1, 1, 8, 0, 4, 1, 0, 0, 50, 9], [34, 4, 5, 1, 8, 0, 4, 1, 0, 0, 40, 9], [38, 2, 1, 1, 5, 5, 4, 0, 7688, 0, 40, 9], [45, 7, 5, 1, 5, 0, 4, 1, 0, 0, 45, 9], [43, 4, 2, 1, 5, 0, 4, 1, 99999, 0, 55, 9], [47, 4, 5, 1, 6, 1, 4, 1, 27828, 0, 60, 9], [42, 6, 1, 1, 2, 0, 4, 1, 15024, 0, 60, 9], [56, 4, 1, 1, 6, 0, 2, 1, 7688, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[23, 4, 1, 1, 4, 3, 4, 1, 0, 0, 40, 9], [50, 2, 5, 1, 8, 3, 2, 1, 0, 0, 45, 9], [24, 4, 1, 1, 7, 3, 4, 0, 0, 0, 40, 3], [62, 4, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [22, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [44, 4, 1, 1, 1, 3, 4, 0, 0, 0, 40, 9], [46, 4, 1, 1, 4, 3, 4, 1, 0, 0, 40, 9], [44, 4, 1, 1, 2, 3, 4, 1, 0, 0, 40, 9], [25, 4, 5, 1, 5, 3, 4, 1, 0, 0, 35, 9], [32, 2, 5, 1, 5, 3, 4, 1, 0, 0, 50, 9]], \"covered_false\": [[57, 5, 5, 1, 6, 3, 4, 1, 99999, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7688, 0, 60, 9], [43, 2, 5, 1, 4, 3, 2, 0, 8614, 0, 47, 9], [56, 5, 2, 1, 5, 3, 4, 1, 99999, 0, 70, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
    			]
    		}
    	]
    }
    
    kubectl delete -f ./models/income.yaml -n seldon-mesh
    kubectl delete -f ./models/income-explainer.yaml -n seldon-mesh
    model.mlops.seldon.io "income" deleted
    model.mlops.seldon.io "income-explainer" deleted
    
    pip install mlserver
    import os
    os.environ["NAMESPACE"] = "seldon-mesh"
    MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP=MESH_IP[0]
    import os
    os.environ['MESH_IP'] = MESH_IP
    MESH_IP
    '172.18.255.2'
    
    cat models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    
    cat pipelines/iris.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: iris-pipeline
    spec:
      steps:
        - name: iris
      output:
        steps:
        - iris
    
    cat models/tfsimple1.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    cat pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl apply -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    kubectl apply -f pipelines/iris.yaml -n ${NAMESPACE}
    
    kubectl apply -f models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl apply -f pipelines/tfsimple.yaml -n ${NAMESPACE}
    model.mlops.seldon.io/iris created
    pipeline.mlops.seldon.io/iris-pipeline created
    model.mlops.seldon.io/tfsimple1 created
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}
    kubectl wait --for condition=ready --timeout=300s pipelines --all -n ${NAMESPACE}
    model.mlops.seldon.io/iris condition met
    model.mlops.seldon.io/tfsimple1 condition met
    pipeline.mlops.seldon.io/iris-pipeline condition met
    pipeline.mlops.seldon.io/tfsimple condition met
    
    seldon model infer iris --inference-host ${MESH_IP}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq -M .
    {
      "model_name": "iris_1",
      "model_version": "1",
      "id": "25e1c1b9-a20f-456d-bdff-c75d5ba83b1f",
      "parameters": {},
      "outputs": [
        {
          "name": "predict",
          "shape": [
            1,
            1
          ],
          "datatype": "INT64",
          "parameters": {
            "content_type": "np"
          },
          "data": [
            2
          ]
        }
      ]
    }
    
    seldon pipeline infer iris-pipeline --inference-host ${MESH_IP}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' |  jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2
          ],
          "name": "predict",
          "shape": [
            1,
            1
          ],
          "datatype": "INT64",
          "parameters": {
            "content_type": "np"
          }
        }
      ]
    }
    
    seldon model infer tfsimple1 --inference-host ${MESH_IP}:80 \
      '{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            1,
            16
          ],
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ]
        }
      ]
    }
    
    seldon pipeline infer tfsimple --inference-host ${MESH_IP}:80 \
      '{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat batch-inputs/iris-input.txt | head -n 1 | jq -M .
    {
      "inputs": [
        {
          "name": "predict",
          "data": [
            0.38606369295833043,
            0.006894049558299753,
            0.6104082981607108,
            0.3958954239450676
          ],
          "datatype": "FP64",
          "shape": [
            1,
            4
          ]
        }
      ]
    }
    
    %%bash
    mlserver infer -u ${MESH_IP} -m iris -i batch-inputs/iris-input.txt -o /tmp/iris-output.txt --workers 5
    
    2023-06-30 11:05:32,389 [mlserver] INFO - server url: 172.18.255.2
    2023-06-30 11:05:32,389 [mlserver] INFO - model name: iris
    2023-06-30 11:05:32,389 [mlserver] INFO - request headers: {}
    2023-06-30 11:05:32,389 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
    2023-06-30 11:05:32,389 [mlserver] INFO - output file path: /tmp/iris-output.txt
    2023-06-30 11:05:32,389 [mlserver] INFO - workers: 5
    2023-06-30 11:05:32,389 [mlserver] INFO - retries: 3
    2023-06-30 11:05:32,389 [mlserver] INFO - batch interval: 0.0
    2023-06-30 11:05:32,389 [mlserver] INFO - batch jitter: 0.0
    2023-06-30 11:05:32,389 [mlserver] INFO - connection timeout: 60
    2023-06-30 11:05:32,389 [mlserver] INFO - micro-batch size: 1
    2023-06-30 11:05:32,503 [mlserver] INFO - Finalizer: processed instances: 100
    2023-06-30 11:05:32,503 [mlserver] INFO - Total processed instances: 100
    2023-06-30 11:05:32,503 [mlserver] INFO - Time taken: 0.11 seconds
    
    %%bash
    mlserver infer -u ${MESH_IP} -m iris-pipeline.pipeline -i batch-inputs/iris-input.txt -o /tmp/iris-pipeline-output.txt --workers 5
    
    2023-06-30 11:05:35,857 [mlserver] INFO - server url: 172.18.255.2
    2023-06-30 11:05:35,858 [mlserver] INFO - model name: iris-pipeline.pipeline
    2023-06-30 11:05:35,858 [mlserver] INFO - request headers: {}
    2023-06-30 11:05:35,858 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
    2023-06-30 11:05:35,858 [mlserver] INFO - output file path: /tmp/iris-pipeline-output.txt
    2023-06-30 11:05:35,858 [mlserver] INFO - workers: 5
    2023-06-30 11:05:35,858 [mlserver] INFO - retries: 3
    2023-06-30 11:05:35,858 [mlserver] INFO - batch interval: 0.0
    2023-06-30 11:05:35,858 [mlserver] INFO - batch jitter: 0.0
    2023-06-30 11:05:35,858 [mlserver] INFO - connection timeout: 60
    2023-06-30 11:05:35,858 [mlserver] INFO - micro-batch size: 1
    2023-06-30 11:05:36,145 [mlserver] INFO - Finalizer: processed instances: 100
    2023-06-30 11:05:36,146 [mlserver] INFO - Total processed instances: 100
    2023-06-30 11:05:36,146 [mlserver] INFO - Time taken: 0.29 seconds
    
    cat /tmp/iris-output.txt | head -n 1 | jq -M .
    {
      "model_name": "iris_1",
      "model_version": "1",
      "id": "46bdfca2-8805-4a72-b1ce-95e4f38c1a19",
      "parameters": {
        "inference_id": "46bdfca2-8805-4a72-b1ce-95e4f38c1a19",
        "batch_index": 0
      },
      "outputs": [
        {
          "name": "predict",
          "shape": [
            1,
            1
          ],
          "datatype": "INT64",
          "parameters": {
            "content_type": "np"
          },
          "data": [
            1
          ]
        }
      ]
    }
    
    cat /tmp/iris-pipeline-output.txt | head -n 1 | jq .
    {
      "model_name": "",
      "id": "37e8c013-b348-41e8-89b9-fea86a4f9632",
      "parameters": {
        "batch_index": 1
      },
      "outputs": [
        {
          "name": "predict",
          "shape": [
            1,
            1
          ],
          "datatype": "INT64",
          "parameters": {
            "content_type": "np"
          },
          "data": [
            1
          ]
        }
      ]
    }
    
    cat batch-inputs/tfsimple-input.txt | head -n 1 | jq -M .
    {
      "inputs": [
        {
          "name": "INPUT0",
          "data": [
            75,
            39,
            9,
            44,
            32,
            97,
            99,
            40,
            13,
            27,
            25,
            36,
            18,
            77,
            62,
            60
          ],
          "datatype": "INT32",
          "shape": [
            1,
            16
          ]
        },
        {
          "name": "INPUT1",
          "data": [
            39,
            7,
            14,
            58,
            13,
            88,
            98,
            66,
            97,
            57,
            49,
            3,
            49,
            63,
            37,
            12
          ],
          "datatype": "INT32",
          "shape": [
            1,
            16
          ]
        }
      ]
    }
    
    %%bash
    mlserver infer -u ${MESH_IP} -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-output.txt --workers 5 -b
    
    2023-06-30 11:22:52,662 [mlserver] INFO - server url: 172.18.255.2
    2023-06-30 11:22:52,662 [mlserver] INFO - model name: tfsimple1
    2023-06-30 11:22:52,662 [mlserver] INFO - request headers: {}
    2023-06-30 11:22:52,662 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
    2023-06-30 11:22:52,662 [mlserver] INFO - output file path: /tmp/tfsimple-output.txt
    2023-06-30 11:22:52,662 [mlserver] INFO - workers: 5
    2023-06-30 11:22:52,662 [mlserver] INFO - retries: 3
    2023-06-30 11:22:52,662 [mlserver] INFO - batch interval: 0.0
    2023-06-30 11:22:52,662 [mlserver] INFO - batch jitter: 0.0
    2023-06-30 11:22:52,662 [mlserver] INFO - connection timeout: 60
    2023-06-30 11:22:52,662 [mlserver] INFO - micro-batch size: 1
    2023-06-30 11:22:52,755 [mlserver] INFO - Finalizer: processed instances: 100
    2023-06-30 11:22:52,755 [mlserver] INFO - Total processed instances: 100
    2023-06-30 11:22:52,756 [mlserver] INFO - Time taken: 0.09 seconds
    
    %%bash
    mlserver infer -u ${MESH_IP} -m tfsimple.pipeline -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-pipeline-output.txt --workers 5
    
    2023-06-30 11:22:54,065 [mlserver] INFO - server url: 172.18.255.2
    2023-06-30 11:22:54,065 [mlserver] INFO - model name: tfsimple.pipeline
    2023-06-30 11:22:54,065 [mlserver] INFO - request headers: {}
    2023-06-30 11:22:54,065 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
    2023-06-30 11:22:54,065 [mlserver] INFO - output file path: /tmp/tfsimple-pipeline-output.txt
    2023-06-30 11:22:54,065 [mlserver] INFO - workers: 5
    2023-06-30 11:22:54,065 [mlserver] INFO - retries: 3
    2023-06-30 11:22:54,065 [mlserver] INFO - batch interval: 0.0
    2023-06-30 11:22:54,065 [mlserver] INFO - batch jitter: 0.0
    2023-06-30 11:22:54,065 [mlserver] INFO - connection timeout: 60
    2023-06-30 11:22:54,065 [mlserver] INFO - micro-batch size: 1
    2023-06-30 11:22:54,302 [mlserver] INFO - Finalizer: processed instances: 100
    2023-06-30 11:22:54,302 [mlserver] INFO - Total processed instances: 100
    2023-06-30 11:22:54,303 [mlserver] INFO - Time taken: 0.24 seconds
    
    cat /tmp/tfsimple-output.txt | head -n 1 | jq -M .
    {
      "model_name": "tfsimple1_1",
      "model_version": "1",
      "id": "19952272-b023-4079-aa08-f1880ded05e5",
      "parameters": {
        "inference_id": "19952272-b023-4079-aa08-f1880ded05e5",
        "batch_index": 1
      },
      "outputs": [
        {
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "parameters": {},
          "data": [
            115,
            69,
            97,
            112,
            73,
            106,
            58,
            182,
            114,
            66,
            64,
            110,
            100,
            24,
            22,
            77
          ]
        },
        {
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "parameters": {},
          "data": [
            -77,
            33,
            25,
            -52,
            -49,
            -88,
            -48,
            0,
            -50,
            26,
            -44,
            46,
            -2,
            18,
            -6,
            -47
          ]
        }
      ]
    }
    
    cat /tmp/tfsimple-pipeline-output.txt | head -n 1 | jq -M .
    {
      "model_name": "",
      "id": "46b05aab-07d9-414d-be96-c03d1863552a",
      "parameters": {
        "batch_index": 3
      },
      "outputs": [
        {
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "data": [
            140,
            164,
            85,
            58,
            152,
            76,
            70,
            56,
            100,
            141,
            98,
            181,
            115,
            177,
            106,
            193
          ]
        },
        {
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32",
          "data": [
            -10,
            0,
            -11,
            -38,
            2,
            -36,
            -52,
            -8,
            -18,
            57,
            94,
            -5,
            -27,
            17,
            58,
            -1
          ]
        }
      ]
    }
    
    kubectl delete -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
    kubectl delete -f pipelines/iris.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "iris" deleted
    pipeline.mlops.seldon.io "iris-pipeline" deleted
    
    kubectl delete -f models/tfsimple1.yaml -n ${NAMESPACE}
    kubectl delete -f pipelines/tfsimple.yaml -n ${NAMESPACE}
    model.mlops.seldon.io "tfsimple1" deleted
    pipeline.mlops.seldon.io "tfsimple" deleted
    

    Open Inference Protocol

    This page describes a predict/inference API independent of any specific ML/DL framework and model server. These APIs are able to support both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server. It is sometimes referred to by its old name "V2 Inference Protocol".

    For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.

    The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.

    HTTP/REST

    A compliant server must implement the health, metadata, and inference APIs described in this section.

    The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.

    All strings in all contexts are case-sensitive.

    For Seldon a server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).

    Health:

    Server Metadata:

    Model Metadata:

    Inference:

    Health

    A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.

    Server Live

    The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.

    Server Ready

    The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.

    Model Ready

    The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies.

    Server Metadata

    The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains theServer Metadata Response JSON Object or theServer Metadata Response JSON Error Object.

    Server Metadata Response JSON Object

    A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as$metadata_server_response, is returned in the HTTP body.

    • “name” : A descriptive name for the server.

    • "version" : The server version.

    • “extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions.

    Server Metadata Response JSON Error Object

    A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the$metadata_server_error_response object.

    • “error” : The descriptive message for the error.

    Model Metadata

    The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains theModel Metadata Response JSON Object or theModel Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    Model Metadata Response JSON Object

    A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as$metadata_model_response, is returned in the HTTP body for every successful model metadata request.

    • “name” : The name of the model.

    • "versions" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.

    • “platform” : The framework/backend for the model. SeePlatforms.

    • “inputs” : The inputs required by the model.

    • “outputs” : The outputs produced by the model.

    Each model input and output tensors’ metadata is described with a$metadata_tensor object.

    • “name” : The name of the tensor.

    • "datatype" : The data-type of the tensor elements as defined inTensor Data Types.

    • "shape" : The shape of the tensor. Variable-size dimensions are specified as -1.

    Model Metadata Response JSON Error Object

    A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the$metadata_model_error_response object.

    • “error” : The descriptive message for the error.

    Inference

    An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains theInference Request JSON Object. In the corresponding response the HTTP body contains theInference Response JSON Object orInference Response JSON Error Object. SeeInference Request Examples for some example HTTP/REST requests and responses.

    Inference Request JSON Object

    The inference request object, identified as $inference_request, is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

    • id : An identifier for this request. Optional, but if specified this identifier must be returned in the response.

    • parameters : An object containing zero or more parameters for this inference request expressed as key/value pairs. SeeParameters for more information.

    • inputs : The input tensors. Each input is described using the$request_input schema defined in Request Input.

    • outputs : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in . Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.

    Request Input

    The $request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch.

    • "name": The name of the input tensor.

    • "shape": The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.

    • "datatype": The data-type of the input tensor elements as defined in Tensor Data Types.

    • "parameters": An object containing zero or more parameters for this input expressed as key/value pairs. See for more information.

    • “data”: The contents of the tensor. See for more information.

    Request Output

    The $request_output JSON is used to request which output tensors should be returned from the model.

    • "name": The name of the output tensor.

    • "parameters": An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.

    Inference Response JSON Object

    A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as$inference_response, is returned in the HTTP body.

    • "model_name": The name of the model used for inference.

    • "model_version": The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response.

    • "id": The "id" identifier given in the request, if any.

    • "parameters": An object containing zero or more parameters for this response expressed as key/value pairs. See for more information.

    • "outputs": The output tensors. Each output is described using the$response_output schema defined in.

    Response Output

    The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch.

    • "name": The name of the output tensor.

    • "shape": The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.

    • "datatype": The data-type of the output tensor elements as defined in Tensor Data Types.

    • "parameters": An object containing zero or more parameters for this input expressed as key/value pairs. See for more information.

    • “data”: The contents of the tensor. See for more information.

    Inference Response JSON Error Object

    A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the$inference_error_response object.

    • “error”: The descriptive message for the error.

    Inference Request Examples

    The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.

    For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.

    Parameters

    The $parameters JSON describes zero or more “name”/”value” pairs, where the “name” is the name of the parameter and the “value” is a$string, $number, or $boolean.

    Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

    Tensor Data

    Tensor data must be presented in row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.

    Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 is problematic to communicate explicitly since there is not a standard fp16 representation across backends nor typically the programmatic support to create the fp16 representation for a JSON number.

    For example, the 2-dimensional matrix:

    Can be represented in its natural format as:

    Or in a flattened one-dimensional representation:

    GRPC

    The GRPC API closely follows the concepts defined in theHTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section.

    All strings in all contexts are case-sensitive.

    The GRPC definition of the service is:

    Health

    A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.

    Server Live

    The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are:

    Server Ready

    The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are:

    Model Ready

    The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are:

    Server Metadata

    The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are:

    Model Metadata

    The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are:

    Inference

    The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are:

    Parameters

    The Parameters message describes a “name”/”value” pair, where the“name” is the name of the parameter and the “value” is a boolean, integer, or string corresponding to the parameter.

    Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

    Tensor Data

    In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements.

    Using a "raw" representation of tensors withModelInferRequest::raw_input_contents andModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see issue here.

    An alternative to the "raw" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type.

    Platforms

    A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to aModel Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format“<project>_<format>”. The following platform names are allowed:

    • tensorrt_plan: A TensorRT model encoded as a serialized engine or “plan”.

    • tensorflow_graphdef: A TensorFlow model encoded as a GraphDef.

    • tensorflow_savedmodel: A TensorFlow model encoded as a SavedModel.

    • onnx_onnxv1: A ONNX model encoded for ONNX Runtime.

    • pytorch_torchscript: A PyTorch model encoded as TorchScript.

    • mxnet_mxnet An MXNet model

    • caffe2_netdef: A Caffe2 model encoded as a NetDef.

    Tensor Data Types

    Tensor data types are shown in the following table along with the size of each type, in bytes.

    Data Type
    Size (bytes)

    BOOL

    1

    UINT8

    1

    UINT16

    2

    UINT32

    4

    UINT64

    8

    INT8

    1

    References

    This document is based on the KServe original created during the lifetime of the KFServing project in Kubeflow by its various contributors including Seldon, NVIDIA, IBM, Bloomberg and others.

    Multi-Namespace Kubernetes

    The below setup also illustrates using kafka specific prefixes for topics and consumerIds for isolation where the kafka cluster is shared with other applications and you want to enforce constraints. You would not strictly need this in this example as we install Kafka just for Seldon here.

    Run Models in Different Namespaces

    Note: The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended. For more details, see

    GET v2/health/live
    GET v2/health/ready
    GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready
    GET v2
    GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
    POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer
    $metadata_server_response =
    {
      "name" : $string,
      "version" : $string,
      "extensions" : [ $string, ... ]
    }
    $metadata_server_error_response =
    {
      "error": $string
    }
    $metadata_model_response =
    {
      "name" : $string,
      "versions" : [ $string, ... ] #optional,
      "platform" : $string,
      "inputs" : [ $metadata_tensor, ... ],
      "outputs" : [ $metadata_tensor, ... ]
    }
    $metadata_tensor =
    {
      "name" : $string,
      "datatype" : $string,
      "shape" : [ $number, ... ]
    }
    $metadata_model_error_response =
    {
      "error": $string
    }
    $inference_request =
    {
      "id" : $string #optional,
      "parameters" : $parameters #optional,
      "inputs" : [ $request_input, ... ],
      "outputs" : [ $request_output, ... ] #optional
    }
    $request_input =
    {
      "name" : $string,
      "shape" : [ $number, ... ],
      "datatype"  : $string,
      "parameters" : $parameters #optional,
      "data" : $tensor_data
    }
    $request_output =
    {
      "name" : $string,
      "parameters" : $parameters #optional,
    }
    $inference_response =
    {
      "model_name" : $string,
      "model_version" : $string #optional,
      "id" : $string,
      "parameters" : $parameters #optional,
      "outputs" : [ $response_output, ... ]
    }
    $response_output =
    {
      "name" : $string,
      "shape" : [ $number, ... ],
      "datatype"  : $string,
      "parameters" : $parameters #optional,
      "data" : $tensor_data
    }
    $inference_error_response =
    {
      "error": <error message string>
    }
    POST /v2/models/mymodel/infer HTTP/1.1
    Host: localhost:8000
    Content-Type: application/json
    Content-Length: <xx>
    {
      "id" : "42",
      "inputs" : [
        {
          "name" : "input0",
          "shape" : [ 2, 2 ],
          "datatype" : "UINT32",
          "data" : [ 1, 2, 3, 4 ]
        },
        {
          "name" : "input1",
          "shape" : [ 3 ],
          "datatype" : "BOOL",
          "data" : [ true ]
        }
      ],
      "outputs" : [
        {
          "name" : "output0"
        }
      ]
    }
    HTTP/1.1 200 OK
    Content-Type: application/json
    Content-Length: <yy>
    {
      "id" : "42"
      "outputs" : [
        {
          "name" : "output0",
          "shape" : [ 3, 2 ],
          "datatype"  : "FP32",
          "data" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ]
        }
      ]
    }
    $parameters =
    {
      $parameter, ...
    }
    
    $parameter = $string : $string | $number | $boolean
    [ 1 2
      4 5 ]
    "data" : [ [ 1, 2 ], [ 4, 5 ] ]
    "data" : [ 1, 2, 4, 5 ]
    //
    // Inference Server GRPC endpoints.
    //
    service GRPCInferenceService
    {
      // Check liveness of the inference server.
      rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}
    
      // Check readiness of the inference server.
      rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}
    
      // Check readiness of a model in the inference server.
      rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}
    
      // Get server metadata.
      rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}
    
      // Get model metadata.
      rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}
    
      // Perform inference using a specific model.
      rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
    }
    message ServerLiveRequest {}
    
    message ServerLiveResponse
    {
      // True if the inference server is live, false if not live.
      bool live = 1;
    }
    message ServerReadyRequest {}
    
    message ServerReadyResponse
    {
      // True if the inference server is ready, false if not ready.
      bool ready = 1;
    }
    message ModelReadyRequest
    {
      // The name of the model to check for readiness.
      string name = 1;
    
      // The version of the model to check for readiness. If not given the
      // server will choose a version based on the model and internal policy.
      string version = 2;
    }
    
    message ModelReadyResponse
    {
      // True if the model is ready, false if not ready.
      bool ready = 1;
    }
    message ServerMetadataRequest {}
    
    message ServerMetadataResponse
    {
      // The server name.
      string name = 1;
    
      // The server version.
      string version = 2;
    
      // The extensions supported by the server.
      repeated string extensions = 3;
    }
    message ModelMetadataRequest
    {
      // The name of the model.
      string name = 1;
    
      // The version of the model to check for readiness. If not given the
      // server will choose a version based on the model and internal policy.
      string version = 2;
    }
    
    message ModelMetadataResponse
    {
      // Metadata for a tensor.
      message TensorMetadata
      {
        // The tensor name.
        string name = 1;
    
        // The tensor data type.
        string datatype = 2;
    
        // The tensor shape. A variable-size dimension is represented
        // by a -1 value.
        repeated int64 shape = 3;
      }
    
      // The model name.
      string name = 1;
    
      // The versions of the model available on the server.
      repeated string versions = 2;
    
      // The model's platform. See Platforms.
      string platform = 3;
    
      // The model's inputs.
      repeated TensorMetadata inputs = 4;
    
      // The model's outputs.
      repeated TensorMetadata outputs = 5;
    }
    message ModelInferRequest
    {
      // An input tensor for an inference request.
      message InferInputTensor
      {
        // The tensor name.
        string name = 1;
    
        // The tensor data type.
        string datatype = 2;
    
        // The tensor shape.
        repeated int64 shape = 3;
    
        // Optional inference input tensor parameters.
        map<string, InferParameter> parameters = 4;
    
        // The tensor contents using a data-type format. This field must
        // not be specified if "raw" tensor contents are being used for
        // the inference request.
        InferTensorContents contents = 5;
      }
    
      // An output tensor requested for an inference request.
      message InferRequestedOutputTensor
      {
        // The tensor name.
        string name = 1;
    
        // Optional requested output tensor parameters.
        map<string, InferParameter> parameters = 2;
      }
    
      // The name of the model to use for inferencing.
      string model_name = 1;
    
      // The version of the model to use for inference. If not given the
      // server will choose a version based on the model and internal policy.
      string model_version = 2;
    
      // Optional identifier for the request. If specified will be
      // returned in the response.
      string id = 3;
    
      // Optional inference parameters.
      map<string, InferParameter> parameters = 4;
    
      // The input tensors for the inference.
      repeated InferInputTensor inputs = 5;
    
      // The requested output tensors for the inference. Optional, if not
      // specified all outputs produced by the model will be returned.
      repeated InferRequestedOutputTensor outputs = 6;
    
      // The data contained in an input tensor can be represented in "raw"
      // bytes form or in the repeated type that matches the tensor's data
      // type. To use the raw representation 'raw_input_contents' must be
      // initialized with data for each tensor in the same order as
      // 'inputs'. For each tensor, the size of this content must match
      // what is expected by the tensor's shape and data type. The raw
      // data must be the flattened, one-dimensional, row-major order of
      // the tensor elements without any stride or padding between the
      // elements. Note that the FP16 data type must be represented as raw
      // content as there is no specific data type for a 16-bit float
      // type.
      //
      // If this field is specified then InferInputTensor::contents must
      // not be specified for any input tensor.
      repeated bytes raw_input_contents = 7;
    }
    
    message ModelInferResponse
    {
      // An output tensor returned for an inference request.
      message InferOutputTensor
      {
        // The tensor name.
        string name = 1;
    
        // The tensor data type.
        string datatype = 2;
    
        // The tensor shape.
        repeated int64 shape = 3;
    
        // Optional output tensor parameters.
        map<string, InferParameter> parameters = 4;
    
        // The tensor contents using a data-type format. This field must
        // not be specified if "raw" tensor contents are being used for
        // the inference response.
        InferTensorContents contents = 5;
      }
    
      // The name of the model used for inference.
      string model_name = 1;
    
      // The version of the model used for inference.
      string model_version = 2;
    
      // The id of the inference request if one was specified.
      string id = 3;
    
      // Optional inference response parameters.
      map<string, InferParameter> parameters = 4;
    
      // The output tensors holding inference results.
      repeated InferOutputTensor outputs = 5;
    
      // The data contained in an output tensor can be represented in
      // "raw" bytes form or in the repeated type that matches the
      // tensor's data type. To use the raw representation 'raw_output_contents'
      // must be initialized with data for each tensor in the same order as
      // 'outputs'. For each tensor, the size of this content must match
      // what is expected by the tensor's shape and data type. The raw
      // data must be the flattened, one-dimensional, row-major order of
      // the tensor elements without any stride or padding between the
      // elements. Note that the FP16 data type must be represented as raw
      // content as there is no specific data type for a 16-bit float
      // type.
      //
      // If this field is specified then InferOutputTensor::contents must
      // not be specified for any output tensor.
      repeated bytes raw_output_contents = 6;
    }
    //
    // An inference parameter value.
    //
    message InferParameter
    {
      // The parameter value can be a string, an int64, a boolean
      // or a message specific to a predefined parameter.
      oneof parameter_choice
      {
        // A boolean parameter value.
        bool bool_param = 1;
    
        // An int64 parameter value.
        int64 int64_param = 2;
    
        // A string parameter value.
        string string_param = 3;
      }
    }
    //
    // The data contained in a tensor represented by the repeated type
    // that matches the tensor's data type. Protobuf oneof is not used
    // because oneofs cannot contain repeated fields.
    //
    message InferTensorContents
    {
      // Representation for BOOL data type. The size must match what is
      // expected by the tensor's shape. The contents must be the flattened,
      // one-dimensional, row-major order of the tensor elements.
      repeated bool bool_contents = 1;
    
      // Representation for INT8, INT16, and INT32 data types. The size
      // must match what is expected by the tensor's shape. The contents
      // must be the flattened, one-dimensional, row-major order of the
      // tensor elements.
      repeated int32 int_contents = 2;
    
      // Representation for INT64 data types. The size must match what
      // is expected by the tensor's shape. The contents must be the
      // flattened, one-dimensional, row-major order of the tensor elements.
      repeated int64 int64_contents = 3;
    
      // Representation for UINT8, UINT16, and UINT32 data types. The size
      // must match what is expected by the tensor's shape. The contents
      // must be the flattened, one-dimensional, row-major order of the
      // tensor elements.
      repeated uint32 uint_contents = 4;
    
      // Representation for UINT64 data types. The size must match what
      // is expected by the tensor's shape. The contents must be the
      // flattened, one-dimensional, row-major order of the tensor elements.
      repeated uint64 uint64_contents = 5;
    
      // Representation for FP32 data type. The size must match what is
      // expected by the tensor's shape. The contents must be the flattened,
      // one-dimensional, row-major order of the tensor elements.
      repeated float fp32_contents = 6;
    
      // Representation for FP64 data type. The size must match what is
      // expected by the tensor's shape. The contents must be the flattened,
      // one-dimensional, row-major order of the tensor elements.
      repeated double fp64_contents = 7;
    
      // Representation for BYTES data type. The size must match what is
      // expected by the tensor's shape. The contents must be the flattened,
      // one-dimensional, row-major order of the tensor elements.
      repeated bytes bytes_contents = 8;
    }

    INT16

    2

    INT32

    4

    INT64

    8

    FP16

    2

    FP32

    4

    FP64

    8

    BYTES

    Variable (max 232)

    Request Output
    Parameters
    Tensor Data
    Parameters
    Response Output
    Parameters
    Tensor Data
    .

    Pipelines

    If you have installed Kafka via the ansible playbook setup-ecosystem then you can use the following command to see the consumer group ids which are reflecting the settings we created.

    We can similarly show the topics that have been created.

    TearDown

    Seldon CLI
    curl -k http://${MESH_IP_NS1}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [[1,2,3,4]]
          }
        ]
      }' | jq -M .
    seldon model infer iris --inference-host ${MESH_IP_NS1}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    curl -k http://${MESH_IP_NS1}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "model_name": "iris",
        "inputs": [
          {
            "name": "input",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP_NS1}:80 \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .
    curl -k http://${MESH_IP_NS2}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "predict",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [[1,2,3,4]]
          }
        ]
      }' | jq -M .
    seldon model infer iris --inference-host ${MESH_IP_NS2}:80 \
      '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'
    curl -k http://${MESH_IP_NS2}:80/v2/models/iris/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: iris" \
      -H "Content-Type: application/json" \
      -d '{
        "model_name": "iris",
        "inputs": [
          {
            "name": "input",
            "datatype": "FP32",
            "shape": [1,4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    
    seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP_NS2}:80 \
       '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .
    curl -k http://${MESH_IP_NS1}:80/v2/models/tfsimples/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimples.pipeline" \
      -H "Content-Type: application/json" \
      -d '{
        "model_name": "simple",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    
    seldon pipeline infer tfsimples --inference-mode grpc --inference-host ${MESH_IP_NS1}:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    curl -k http://${MESH_IP_NS2}:80/v2/models/tfsimples/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimples.pipeline" \
      -H "Content-Type: application/json" \
      -d '{
        "model_name": "simple",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    seldon pipeline infer tfsimples --inference-mode grpc --inference-host ${MESH_IP_NS2}:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    helm upgrade --install seldon-core-v2-crds  ../k8s/helm-charts/seldon-core-v2-crds -n seldon-mesh
    Release "seldon-core-v2-crds" does not exist. Installing it now.
    NAME: seldon-core-v2-crds
    LAST DEPLOYED: Tue Aug 15 11:01:03 2023
    NAMESPACE: seldon-mesh
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    helm upgrade --install seldon-v2 ../k8s/helm-charts/seldon-core-v2-setup/ -n seldon-mesh \
        --set controller.clusterwide=true \
        --set kafka.topicPrefix=myorg \
        --set kafka.consumerGroupIdPrefix=myorg
    Release "seldon-v2" does not exist. Installing it now.
    NAME: seldon-v2
    LAST DEPLOYED: Tue Aug 15 11:01:07 2023
    NAMESPACE: seldon-mesh
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    kubectl create namespace ns1
    kubectl create namespace ns2
    namespace/ns1 created
    namespace/ns2 created
    
    helm install seldon-v2-runtime ../k8s/helm-charts/seldon-core-v2-runtime  -n ns1 --wait
    NAME: seldon-v2-runtime
    LAST DEPLOYED: Tue Aug 15 11:01:11 2023
    NAMESPACE: ns1
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    helm install seldon-v2-servers ../k8s/helm-charts/seldon-core-v2-servers  -n ns1 --wait
    NAME: seldon-v2-servers
    LAST DEPLOYED: Tue Aug 15 10:47:31 2023
    NAMESPACE: ns1
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    helm install seldon-v2-runtime ../k8s/helm-charts/seldon-core-v2-runtime  -n ns2 --wait
    NAME: seldon-v2-runtime
    LAST DEPLOYED: Tue Aug 15 10:53:12 2023
    NAMESPACE: ns2
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    helm install seldon-v2-servers ../k8s/helm-charts/seldon-core-v2-servers  -n ns2 --wait
    NAME: seldon-v2-servers
    LAST DEPLOYED: Tue Aug 15 10:53:28 2023
    NAMESPACE: ns2
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
    kubectl wait --for condition=ready --timeout=300s server --all -n ns1
    server.mlops.seldon.io/mlserver condition met
    server.mlops.seldon.io/triton condition met
    
    kubectl wait --for condition=ready --timeout=300s server --all -n ns2
    server.mlops.seldon.io/mlserver condition met
    server.mlops.seldon.io/triton condition met
    
    MESH_IP=!kubectl get svc seldon-mesh -n ns1 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP_NS1=MESH_IP[0]
    import os
    os.environ['MESH_IP_NS1'] = MESH_IP_NS1
    MESH_IP_NS1
    '172.18.255.2'
    
    MESH_IP=!kubectl get svc seldon-mesh -n ns2 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
    MESH_IP_NS2=MESH_IP[0]
    import os
    os.environ['MESH_IP_NS2'] = MESH_IP_NS2
    MESH_IP_NS2
    '172.18.255.4'
    
    cat ./models/sklearn-iris-gs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: iris
    spec:
      storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
      requirements:
      - sklearn
      memory: 100Ki
    
    kubectl create -f ./models/sklearn-iris-gs.yaml -n ns1
    model.mlops.seldon.io/iris created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ns1
    model.mlops.seldon.io/iris condition met
    
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "3ca1757c-df02-4e57-87c1-38311bcc5943",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    {
      "modelName": "iris_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "predict",
          "datatype": "INT64",
          "shape": [
            "1",
            "1"
          ],
          "parameters": {
            "content_type": {
              "stringParam": "np"
            }
          },
          "contents": {
            "int64Contents": [
              "2"
            ]
          }
        }
      ]
    }
    
    kubectl create -f ./models/sklearn-iris-gs.yaml -n ns2
    model.mlops.seldon.io/iris created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ns2
    model.mlops.seldon.io/iris condition met
    {
    	"model_name": "iris_1",
    	"model_version": "1",
    	"id": "f706a23e-775f-4765-bd18-2e98d83bf7d5",
    	"parameters": {},
    	"outputs": [
    		{
    			"name": "predict",
    			"shape": [
    				1,
    				1
    			],
    			"datatype": "INT64",
    			"parameters": {
    				"content_type": "np"
    			},
    			"data": [
    				2
    			]
    		}
    	]
    }
    
    {
      "modelName": "iris_1",
      "modelVersion": "1",
      "outputs": [
        {
          "name": "predict",
          "datatype": "INT64",
          "shape": [
            "1",
            "1"
          ],
          "parameters": {
            "content_type": {
              "stringParam": "np"
            }
          },
          "contents": {
            "int64Contents": [
              "2"
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./models/sklearn-iris-gs.yaml -n ns1
    kubectl delete -f ./models/sklearn-iris-gs.yaml -n ns2
    model.mlops.seldon.io "iris" deleted
    model.mlops.seldon.io "iris" deleted
    
    kubectl create -f ./models/tfsimple1.yaml -n ns1
    kubectl create -f ./models/tfsimple2.yaml -n ns1
    kubectl create -f ./models/tfsimple1.yaml -n ns2
    kubectl create -f ./models/tfsimple2.yaml -n ns2
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n ns1
    kubectl wait --for condition=ready --timeout=300s model --all -n ns2
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    
    kubectl create -f ./pipelines/tfsimples.yaml -n ns1
    kubectl create -f ./pipelines/tfsimples.yaml -n ns2
    pipeline.mlops.seldon.io/tfsimples created
    pipeline.mlops.seldon.io/tfsimples created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n ns1
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n ns2
    pipeline.mlops.seldon.io/tfsimples condition met
    pipeline.mlops.seldon.io/tfsimples condition met
    
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    
    kubectl exec seldon-kafka-0 -n seldon-mesh -- bin/kafka-consumer-groups.sh --list --bootstrap-server localhost:9092
    myorg-ns2-seldon-pipelinegateway-dfd61b49-4bb9-4684-adce-0b7cc215d3af
    myorg-ns2-seldon-modelgateway-17
    myorg-ns1-seldon-pipelinegateway-d4fc83e6-29cb-442e-90cd-92a389961cfe
    myorg-ns2-seldon-modelgateway-60
    myorg-ns2-seldon-dataflow-73d465744b7b1b5be20e88d6245e50bd
    myorg-ns1-seldon-modelgateway-60
    myorg-ns1-seldon-modelgateway-17
    myorg-ns1-seldon-dataflow-f563e04e093caa20c03e6eced084331b
    
    kubectl exec seldon-kafka-0 -n seldon-mesh -- bin/kafka-topics.sh --bootstrap-server=localhost:9092 --list
    __consumer_offsets
    myorg.ns1.errors.errors
    myorg.ns1.model.iris.inputs
    myorg.ns1.model.iris.outputs
    myorg.ns1.model.tfsimple1.inputs
    myorg.ns1.model.tfsimple1.outputs
    myorg.ns1.model.tfsimple2.inputs
    myorg.ns1.model.tfsimple2.outputs
    myorg.ns1.pipeline.tfsimples.inputs
    myorg.ns1.pipeline.tfsimples.outputs
    myorg.ns2.errors.errors
    myorg.ns2.model.iris.inputs
    myorg.ns2.model.iris.outputs
    myorg.ns2.model.tfsimple1.inputs
    myorg.ns2.model.tfsimple1.outputs
    myorg.ns2.model.tfsimple2.inputs
    myorg.ns2.model.tfsimple2.outputs
    myorg.ns2.pipeline.tfsimples.inputs
    myorg.ns2.pipeline.tfsimples.outputs
    
    kubectl delete -f ./pipelines/tfsimples.yaml -n ns1
    kubectl delete -f ./pipelines/tfsimples.yaml -n ns2
    pipeline.mlops.seldon.io "tfsimples" deleted
    pipeline.mlops.seldon.io "tfsimples" deleted
    
    kubectl delete -f ./models/tfsimple1.yaml -n ns1
    kubectl delete -f ./models/tfsimple2.yaml -n ns1
    kubectl delete -f ./models/tfsimple1.yaml -n ns2
    kubectl delete -f ./models/tfsimple2.yaml -n ns2
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    
    helm delete seldon-v2-servers -n ns1 --wait
    helm delete seldon-v2-servers -n ns2 --wait
    release "seldon-v2-servers" uninstalled
    release "seldon-v2-servers" uninstalled
    
    helm delete seldon-v2-runtime -n ns1 --wait
    helm delete seldon-v2-runtime -n ns2 --wait
    release "seldon-v2-runtime" uninstalled
    release "seldon-v2-runtime" uninstalled
    
    helm delete seldon-v2 -n seldon-mesh --wait
    release "seldon-v2" uninstalled
    
    helm delete seldon-core-v2-crds -n seldon-mesh
    release "seldon-core-v2-crds" uninstalled
    
    kubectl delete namespace ns1
    kubectl delete namespace ns2
    namespace "ns1" deleted
    namespace "ns2" deleted
    

    Pipeline examples

    These examples illustrates a series of Pipelines showing of different ways of combining flows of data and conditional logic. We assume you have Seldon Core 2 running locally.

    Before you begin

    1. Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.

    2. Ensure that you are performing these steps in the directory where you have downloaded the .

    3. Get the IP address of the Seldon Core 2 instance running with Istio:

    Make a note of the IP address that is displayed in the output. Replace <INGRESS_IP> with your service mesh's ingress IP address in the following commands.

    Models Used

    • gs://seldon-models/triton/simple an example Triton tensorflow model that takes 2 inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1 from INPUT0 to produce OUTPUT1. See for the original source code and license.

    • Other models can be found at https://github.com/SeldonIO/triton-python-examples

    Model Chaining

    Chain the output of one model into the next. Also shows chaning the tensor names via tensorMap to conform to the expected input tensor names of the second model.

    This pipeline chains the output of tfsimple1 into tfsimple2. As these models have compatible shape and data type this can be done. However, the output tensor names from tfsimple1 need to be renamed to match the input tensor names for tfsimple2. You can do this with the tensorMap feature.

    The output of the Pipeline is the output from tfsimple2.

    You can use the Seldon CLI pipeline inspect feature to look at the data for all steps of the pipeline for the last data item passed through the pipeline (the default). This can be useful for debugging.

    Next, take a look get the output as json and use the jq tool to get just one value.

    Model Chaining from inputs

    Chain the output of one model into the next. Shows using the input and outputs and combining.

    Model Join

    Join two flows of data from two models as input to a third model. This shows how individual flows of data can be combined.

    In this pipeline for the input to tfsimple3 and join 1 output tensor each from the two previous models tfsimple1 and tfsimple2. You need to use the tensorMap feature to rename each output tensor to one of the expected input tensors for the tfsimple3 model.

    The outputs are the sequence "2,4,6..." which conforms to the logic of this model (addition and subtraction) when fed the output of the first two models.

    Conditional

    Shows conditional data flows - one of two models is run based on output tensors from first.

    Here we assume the conditional model can output two tensors OUTPUT0 and OUTPUT1 but only outputs the former if the CHOICE input tensor is set to 0 otherwise it outputs tensor OUTPUT1. By this means only one of the two downstream models will receive data and run. The output steps does an any join from both models and whichever data appears first will be sent as output to pipeline. As in this case only 1 of the two models add10 and mul10 runs we will receive their output.

    The mul10 model runs as the CHOICE tensor is set to 0.

    The add10 model will run as the CHOICE tensor is not set to zero.

    Pipeline Input Tensors

    Access to indivudal tensors in pipeline inputs

    This pipeline shows how we can access pipeline inputs INPUT0 and INPUT1 from different steps.

    Trigger Joins

    Shows how joins can be used for triggers as well.

    Here we required tensors names ok1 or ok2 to exist on pipeline inputs to run the mul10 model but require tensor ok3 to exist on pipeline inputs to run the add10 model. The logic on mul10 is handled by a trigger join of any meaning either of these input data can exist to satisfy the trigger join.

    Trigger the first join.

    Now, you can trigger the second join.

    samples
    here
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    ---  
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimples.yaml
    kubectl create -f ./pipelines/tfsimples.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimples/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimples.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' |jq 
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimples --inference-mode grpc --inference-host <INGRESS_IP>:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    
    seldon pipeline inspect tfsimples
    seldon.default.model.tfsimple1.inputs	ciep298fh5ss73dpdir0	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.model.tfsimple1.outputs	ciep298fh5ss73dpdir0	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.model.tfsimple2.inputs	ciep298fh5ss73dpdir0	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	ciep298fh5ss73dpdir0	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimples.inputs	ciep298fh5ss73dpdir0	{"modelName":"tfsimples", "inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.pipeline.tfsimples.outputs	ciep298fh5ss73dpdir0	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    
    seldon pipeline inspect tfsimples --format json | jq -M .topics[0].msgs[0].value
    {
      "inputs": [
        {
          "name": "INPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              1,
              2,
              3,
              4,
              5,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              13,
              14,
              15,
              16
            ]
          }
        },
        {
          "name": "INPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              1,
              2,
              3,
              4,
              5,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              13,
              14,
              15,
              16
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./pipelines/tfsimples.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "tfsimples" deleted
    
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimples-input.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimples-input
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
          inputs:
          - tfsimple1.inputs.INPUT0
          - tfsimple1.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimples-input.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples-input created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimples-input condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimples-input/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimples-input.pipeline" \
      -d '{
        "model_name": "simple",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimples-input \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    kubectl delete -f ./pipelines/tfsimples-input.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "tfsimples-input" deleted
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    cat ./models/tfsimple1.yaml
    echo "---"
    cat ./models/tfsimple2.yaml
    echo "---"
    cat ./models/tfsimple3.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple3
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple3.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    model.mlops.seldon.io/tfsimple3 created
    
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    model.mlops.seldon.io/tfsimple3 condition met
    
    cat ./pipelines/tfsimples-join.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: join
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
        - name: tfsimple3
          inputs:
          - tfsimple1.outputs.OUTPUT0
          - tfsimple2.outputs.OUTPUT1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple2.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple3
    kubectl create -f ./pipelines/tfsimples-join.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/join created
    curl -k http://<INGRESS_IP>:80/v2/models/join/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: join.pipeline" \
      -d '{
        "model_name": "simple",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1, 16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' |jq
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer join --inference-mode grpc --inference-host <INGRESS_IP>:80 \
        '{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT0",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        },
        {
          "name": "OUTPUT1",
          "datatype": "INT32",
          "shape": [
            "1",
            "16"
          ],
          "contents": {
            "intContents": [
              2,
              4,
              6,
              8,
              10,
              12,
              14,
              16,
              18,
              20,
              22,
              24,
              26,
              28,
              30,
              32
            ]
          }
        }
      ]
    }
    kubectl delete -f ./pipelines/tfsimples-join.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "join" deleted
    
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple3.yaml -n seldon-mesh
    model.mlops.seldon.io "tfsimple1" deleted
    model.mlops.seldon.io "tfsimple2" deleted
    model.mlops.seldon.io "tfsimple3" deleted
    
    cat ./models/conditional.yaml
    echo "---"
    cat ./models/add10.yaml
    echo "---"
    cat ./models/mul10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: conditional
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/conditional"
      requirements:
      - triton
      - python
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    
    kubectl create -f ./models/conditional.yaml -n seldon-mesh
    kubectl create -f ./models/add10.yaml -n seldon-mesh
    kubectl create -f ./models/mul10.yaml -n seldon-mesh
    model.mlops.seldon.io/conditional created
    model.mlops.seldon.io/add10 created
    model.mlops.seldon.io/mul10 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/conditional condition met
    model.mlops.seldon.io/add10 condition met
    model.mlops.seldon.io/mul10 condition met
    
    cat ./pipelines/conditional.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-conditional
    spec:
      steps:
      - name: conditional
      - name: mul10
        inputs:
        - conditional.outputs.OUTPUT0
        tensorMap:
          conditional.outputs.OUTPUT0: INPUT
      - name: add10
        inputs:
        - conditional.outputs.OUTPUT1
        tensorMap:
          conditional.outputs.OUTPUT1: INPUT
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    
    kubectl create -f ./pipelines/conditional.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-conditional created
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-conditional condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple-conditional/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple-conditional.pipeline" \
      -d '{
        "model_name": "conditional",
        "inputs": [
          {
            "name": "CHOICE",
            "datatype": "INT32",
            "shape": [1],
            "data": [0]
          },
          {
            "name": "INPUT0",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          },
          {
            "name": "INPUT1",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            10,
            20,
            30,
            40
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        }
      ]
    }
    
    seldon pipeline infer tfsimple-conditional --inference-mode grpc --inference-host <INGRESS_IP>:80 \
     '{"model_name":"conditional","inputs":[{"name":"CHOICE","contents":{"int_contents":[0]},"datatype":"INT32","shape":[1]},{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              10,
              20,
              30,
              40
            ]
          }
        }
      ]
    }
    
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple-conditional/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple-conditional.pipeline" \
      -d '{
        "model_name": "conditional",
        "inputs": [
          {
            "name": "CHOICE",
            "datatype": "INT32",
            "shape": [1],
            "data": [1]
          },
          {
            "name": "INPUT0",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          },
          {
            "name": "INPUT1",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            11,
            12,
            13,
            14
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        }
      ]
    }
    
    seldon pipeline infer tfsimple-conditional --inference-mode grpc --inference-host <INGRESS_IP>:80 \ 
     '{"model_name":"conditional","inputs":[{"name":"CHOICE","contents":{"int_contents":[1]},"datatype":"INT32","shape":[1]},{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              11,
              12,
              13,
              14
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./pipelines/conditional.yaml -n seldon-mesh
    kubectl delete -f ./models/conditional.yaml -n seldon-mesh
    kubectl delete -f ./models/add10.yaml -n seldon-mesh
    kubectl delete -f ./models/mul10.yaml -n seldon-mesh
    cat ./models/mul10.yaml
    echo "---"
    cat ./models/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    
    kubectl create -f ./models/mul10.yaml -n seldon-mesh
    kubectl create -f ./models/add10.yaml -n seldon-mesh
    model.mlops.seldon.io/mul10 created
    model.mlops.seldon.io/add10 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/mul10 condition met
    model.mlops.seldon.io/add10 condition met
    cat ./pipelines/pipeline-inputs.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: pipeline-inputs
    spec:
      steps:
      - name: mul10
        inputs:
        - pipeline-inputs.inputs.INPUT0
        tensorMap:
          pipeline-inputs.inputs.INPUT0: INPUT
      - name: add10
        inputs:
        - pipeline-inputs.inputs.INPUT1
        tensorMap:
          pipeline-inputs.inputs.INPUT1: INPUT
      output:
        steps:
        - mul10
        - add10
    
    kubectl create -f ./pipelines/pipeline-inputs.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/pipeline-inputs created
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/pipeline-inputs condition met
    curl -k http://<INGRESS_IP>:80/v2/models/pipeline-inputs/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: pipeline-inputs.pipeline" \
      -d '{
        "model_name": "pipeline",
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          },
          {
            "name": "INPUT1",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            10,
            20,
            30,
            40
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        },
        {
          "data": [
            11,
            12,
            13,
            14
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        }
      ]
    }
    
    seldon pipeline infer pipeline-inputs --inference-mode grpc --inference-host <INGRESS-IP>:80 \
        '{"model_name":"pipeline","inputs":[{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              10,
              20,
              30,
              40
            ]
          }
        },
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              11,
              12,
              13,
              14
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./pipelines/pipeline-inputs.yaml -n seldon-mesh
    kubectl delete -f ./models/mul10.yaml -n seldon-mesh
    kubectl delete -f ./models/add10.yaml -n seldon-mesh
    cat ./models/mul10.yaml
    cat ./models/add10.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: mul10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
      requirements:
      - triton
      - python
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: add10
    spec:
      storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
      requirements:
      - triton
      - python
    
    kubectl create -f ./models/mul10.yaml -n seldon-mesh
    kubectl create -f ./models/add10.yaml -n seldon-mesh
    
    model.mlops.seldon.io/mul10 created
    model.mlops.seldon.io/add10 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/mul10 condition met
    model.mlops.seldon.io/add10 condition met
    cat ./pipelines/trigger-joins.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: trigger-joins
    spec:
      steps:
      - name: mul10
        inputs:
        - trigger-joins.inputs.INPUT
        triggers:
        - trigger-joins.inputs.ok1
        - trigger-joins.inputs.ok2
        triggersJoinType: any
      - name: add10
        inputs:
        - trigger-joins.inputs.INPUT
        triggers:
        - trigger-joins.inputs.ok3
      output:
        steps:
        - mul10
        - add10
        stepsJoin: any
    
    kubectl create -f ./pipelines/trigger-joins.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/trigger-joins created
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/trigger-joins condition met
    curl -k http://<INGRESS_IP>:80/v2/models/trigger-joins/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: trigger-joins.pipeline" \
      -d '{
        "model_name": "pipeline",
        "inputs": [
          {
            "name": "ok1",
            "datatype": "FP32",
            "shape": [1],
            "data": [1]
          },
          {
            "name": "INPUT",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            10,
            20,
            30,
            40
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        }
      ]
    }
    seldon pipeline infer trigger-joins --inference-mode grpc --inference-host <INGRESS_IP>:80\
        '{"model_name":"pipeline","inputs":[{"name":"ok1","contents":{"fp32_contents":[1]},"datatype":"FP32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              10,
              20,
              30,
              40
            ]
          }
        }
      ]
    }
    curl -k http://<INGRESS_IP>:80/v2/models/trigger-joins/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: trigger-joins.pipeline" \
      -d '{
        "model_name": "pipeline",
        "inputs": [
          {
            "name": "ok3",
            "datatype": "FP32",
            "shape": [1],
            "data": [1]
          },
          {
            "name": "INPUT",
            "datatype": "FP32",
            "shape": [4],
            "data": [1,2,3,4]
          }
        ]
      }' | jq -M .
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            11,
            12,
            13,
            14
          ],
          "name": "OUTPUT",
          "shape": [
            4
          ],
          "datatype": "FP32"
        }
      ]
    }
    
    seldon pipeline infer trigger-joins --inference-mode grpc --inference-host <INGRESS_IP>:80 \
        '{"model_name":"pipeline","inputs":[{"name":"ok3","contents":{"fp32_contents":[1]},"datatype":"FP32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .
    {
      "outputs": [
        {
          "name": "OUTPUT",
          "datatype": "FP32",
          "shape": [
            "4"
          ],
          "contents": {
            "fp32Contents": [
              11,
              12,
              13,
              14
            ]
          }
        }
      ]
    }
    
    kubectl delete -f ./pipelines/trigger-joins.yaml -n seldon-mesh
    pipeline.mlops.seldon.io "trigger-joins" deleted
    kubectl delete -f ./models/mul10.yaml -n seldon-mesh
    kubectl delete -f ./models/add10.yaml -n seldon-mesh
    model.mlops.seldon.io "mul10" deleted
    model.mlops.seldon.io "add10" deleted
    https://github.com/SeldonIO/seldon-core/blob/v2/samples/pipelines/tfsimples.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimples
    spec:
      steps:
        - name: tfsimple1
        - name: tfsimple2
          inputs:
          - tfsimple1
          tensorMap:
            tfsimple1.outputs.OUTPUT0: INPUT0
            tfsimple1.outputs.OUTPUT1: INPUT1
      output:
        steps:
        - tfsimple2
    

    Pipeline to pipeline examples

    This examples illustrates a series of Pipelines that are joined together.

    Before you begin

    1. Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.

    2. Ensure that you are performing these steps in the directory where you have downloaded the .

    3. Get the IP address of the Seldon Core 2 instance running with Istio:

    Make a note of the IP address that is displayed in the output. Replace <INGRESS_IP> with your service mesh's ingress IP address in the following commands.

    Models Used

    • gs://seldon-models/triton/simple an example Triton tensorflow model that takes 2 inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1 from INPUT0 to produce OUTPUT1. See for the original source code and license.

    • Other models can be found at https://github.com/SeldonIO/triton-python-examples

    Pipeline pulling from one other Pipeline

    Pipeline pulling from two other Pipelines

    Pipeline pullin from one pipeline with a trigger to another

    Pipeline pulling from one other Pipeline Step

    Pipeline pulling from two other Pipeline steps from same model

    samples
    here
    pipeline-to-pipeline
    pipeline-to-pipeline
    pipeline-to-pipeline
    pipeline-to-pipeline
    ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    echo "Seldon Core 2: http://$ISTIO_INGRESS"
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl create -f ./pipelines/tfsimple.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat ./pipelines/tfsimple-extended.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -H "x-request-id: test-id" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --header x-request-id=test-id --inference-host <INGRESS_IP>:80 \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			],
    			"name": "OUTPUT0",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		},
    		{
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			],
    			"name": "OUTPUT1",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		}
    	]
    }
    
    seldon pipeline inspect tfsimple
    seldon.default.model.tfsimple1.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.model.tfsimple1.outputs	test-id	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.pipeline.tfsimple.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.pipeline.tfsimple.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    
    seldon pipeline inspect tfsimple-extended
    seldon.default.model.tfsimple2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	test-id	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimple-extended.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    
    kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl create -f ./pipelines/tfsimple.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat ./pipelines/tfsimple-extended.yaml
    echo "---"
    cat ./pipelines/tfsimple-extended2.yaml
    echo "---"
    cat ./pipelines/tfsimple-combined.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended2
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-combined
    spec:
      input:
        externalInputs:
          - tfsimple-extended.outputs.OUTPUT0
          - tfsimple-extended2.outputs.OUTPUT1
        tensorMap:
          tfsimple-extended.outputs.OUTPUT0: INPUT0
          tfsimple-extended2.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
    kubectl create -f ./pipelines/tfsimple-combined.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended created
    pipeline.mlops.seldon.io/tfsimple-extended2 created
    pipeline.mlops.seldon.io/tfsimple-combined created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended condition met
    pipeline.mlops.seldon.io/tfsimple-extended2 condition met
    pipeline.mlops.seldon.io/tfsimple-combined condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -H "x-request-id: test-id2" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --header x-request-id=test-id2 --inference-host <INGRESS_IP>:80 \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			],
    			"name": "OUTPUT0",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		},
    		{
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			],
    			"name": "OUTPUT1",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		}
    	]
    }
    
    seldon pipeline inspect tfsimple
    seldon.default.model.tfsimple1.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.model.tfsimple1.outputs	test-id2	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.pipeline.tfsimple.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.pipeline.tfsimple.outputs	test-id2	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    
    seldon pipeline inspect tfsimple-extended --offset 2 --verbose
    seldon.default.model.tfsimple2.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-e468d06afdab8f52-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id]	x-forwarded-proto=[http]	x-envoy-upstream-service-time=[5]	x-seldon-route=[:tfsimple1_1:]
    seldon.default.model.tfsimple2.outputs	test-id2	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-73bd1ee54a94d8fb-01]
    seldon.default.pipeline.tfsimple-extended.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		pipeline=[tfsimple-extended]	traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-fee12926788ce3b6-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id]	x-forwarded-proto=[http]	x-envoy-upstream-service-time=[5]	x-seldon-route=[:tfsimple1_1:]
    seldon.default.pipeline.tfsimple-extended.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-4df8459a992e0278-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]
    seldon.default.pipeline.tfsimple-extended.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		pipeline=[tfsimple-extended]	traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-b2f899a739c5cafd-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id]	x-forwarded-proto=[http]	x-envoy-upstream-service-time=[5]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]
    seldon.default.pipeline.tfsimple-extended.outputs	test-id2	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-dfa399143feec23d-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]
    
    seldon pipeline inspect tfsimple-extended2 --offset 2
    seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimple-extended2.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    
    seldon pipeline inspect tfsimple-combined
    seldon.default.model.tfsimple2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	test-id	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.pipeline.tfsimple-combined.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.pipeline.tfsimple-combined.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    
    kubectl delete -f ./pipelines/tfsimple-combined.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl create -f ./pipelines/tfsimple.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat ./pipelines/tfsimple-extended.yaml
    echo "---"
    cat ./pipelines/tfsimple-extended2.yaml
    echo "---"
    cat ./pipelines/tfsimple-combined-trigger.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended2
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-combined-trigger
    spec:
      input:
        externalInputs:
          - tfsimple-extended.outputs
        externalTriggers:
          - tfsimple-extended2.outputs
        tensorMap:
          tfsimple-extended.outputs.OUTPUT0: INPUT0
          tfsimple-extended.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
    kubectl create -f ./pipelines/tfsimple-combined-trigger.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended created
    pipeline.mlops.seldon.io/tfsimple-extended2 created
    pipeline.mlops.seldon.io/tfsimple-combined-trigger created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended condition met
    pipeline.mlops.seldon.io/tfsimple-extended2 condition met
    pipeline.mlops.seldon.io/tfsimple-combined-trigger condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -H "x-request-id: test-id3" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --header x-request-id=test-id3 --inference-host <INGRESS_IP>:80 \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			],
    			"name": "OUTPUT0",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		},
    		{
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			],
    			"name": "OUTPUT1",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		}
    	]
    }
    
    seldon pipeline inspect tfsimple
    seldon.default.model.tfsimple1.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.model.tfsimple1.outputs	test-id3	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.pipeline.tfsimple.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
    seldon.default.pipeline.tfsimple.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    
    seldon pipeline inspect tfsimple-extended --offset 2
    seldon.default.model.tfsimple2.outputs	test-id3	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimple-extended.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimple-extended.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    
    seldon pipeline inspect tfsimple-extended2 --offset 2
    seldon.default.model.tfsimple2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
    
    seldon pipeline inspect tfsimple-combined-trigger
    seldon.default.model.tfsimple2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	test-id3	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    seldon.default.pipeline.tfsimple-combined-trigger.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.pipeline.tfsimple-combined-trigger.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
    
    kubectl delete -f ./pipelines/tfsimple-combined-trigger.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl create -f ./pipelines/tfsimple.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat ./pipelines/tfsimple-extended-step.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended-step
    spec:
      input:
        externalInputs:
          - tfsimple.step.tfsimple1.outputs
        tensorMap:
          tfsimple.step.tfsimple1.outputs.OUTPUT0: INPUT0
          tfsimple.step.tfsimple1.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimple-extended-step.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended-step created
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended-step condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -H "Content-Type: application/json" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple \
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'
    {
    	"model_name": "",
    	"outputs": [
    		{
    			"data": [
    				2,
    				4,
    				6,
    				8,
    				10,
    				12,
    				14,
    				16,
    				18,
    				20,
    				22,
    				24,
    				26,
    				28,
    				30,
    				32
    			],
    			"name": "OUTPUT0",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		},
    		{
    			"data": [
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0,
    				0
    			],
    			"name": "OUTPUT1",
    			"shape": [
    				1,
    				16
    			],
    			"datatype": "INT32"
    		}
    	]
    }
    
    seldon pipeline inspect tfsimple --verbose
    seldon.default.model.tfsimple1.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}		pipeline=[tfsimple]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-90824e4cb70c3242-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[cg5g6ogfh5ss73a44vvg]
    seldon.default.model.tfsimple1.outputs	cg5g6ogfh5ss73a44vvg	{"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}		x-request-id=[cg5g6ogfh5ss73a44vvg]	pipeline=[tfsimple]	x-envoy-upstream-service-time=[8]	x-seldon-route=[:tfsimple1_1:]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-ca023a540fa463b3-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]
    seldon.default.pipeline.tfsimple.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}		pipeline=[tfsimple]	x-request-id=[cg5g6ogfh5ss73a44vvg]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-843d6ce39292396d-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]
    seldon.default.pipeline.tfsimple.outputs	cg5g6ogfh5ss73a44vvg	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}		x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[cg5g6ogfh5ss73a44vvg]	x-envoy-upstream-service-time=[8]	x-seldon-route=[:tfsimple1_1:]	pipeline=[tfsimple]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-ee7527353e9fe5a2-01]	x-forwarded-proto=[http]
    
    seldon pipeline inspect tfsimple-extended-step
    seldon.default.model.tfsimple2.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	cg5g6ogfh5ss73a44vvg	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
    seldon.default.pipeline.tfsimple-extended-step.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended-step.outputs	cg5g6ogfh5ss73a44vvg	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
    
    kubectl delete -f ./pipelines/tfsimple-extended-step.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
    cat ./models/tfsimple1.yaml
    cat ./models/tfsimple2.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple1
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
      name: tfsimple2
    spec:
      storageUri: "gs://seldon-models/triton/simple"
      requirements:
      - tensorflow
      memory: 100Ki
    
    kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 created
    model.mlops.seldon.io/tfsimple2 created
    kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
    model.mlops.seldon.io/tfsimple1 condition met
    model.mlops.seldon.io/tfsimple2 condition met
    cat ./pipelines/tfsimple.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple
    spec:
      steps:
        - name: tfsimple1
      output:
        steps:
        - tfsimple1
    
    kubectl create -f ./pipelines/tfsimple.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    cat ./pipelines/tfsimple-extended.yaml
    echo "---"
    cat ./pipelines/tfsimple-extended2.yaml
    echo "---"
    cat ./pipelines/tfsimple-combined-step.yaml
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-extended2
    spec:
      input:
        externalInputs:
          - tfsimple.outputs
        tensorMap:
          tfsimple.outputs.OUTPUT0: INPUT0
          tfsimple.outputs.OUTPUT1: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    ---
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Pipeline
    metadata:
      name: tfsimple-combined-step
    spec:
      input:
        externalInputs:
          - tfsimple-extended.step.tfsimple2.outputs.OUTPUT0
          - tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0
        tensorMap:
          tfsimple-extended.step.tfsimple2.outputs.OUTPUT0: INPUT0
          tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0: INPUT1
      steps:
        - name: tfsimple2
      output:
        steps:
        - tfsimple2
    
    kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh 
    kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh 
    kubectl create -f ./pipelines/tfsimple-combined-step.yaml -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended created
    pipeline.mlops.seldon.io/tfsimple-extended2 created
    pipeline.mlops.seldon.io/tfsimple-combined-step created
    
    kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh
    pipeline.mlops.seldon.io/tfsimple-extended condition met
    pipeline.mlops.seldon.io/tfsimple-extended2 condition met
    pipeline.mlops.seldon.io/tfsimple-combined-step condition met
    curl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
      -H "Host: seldon-mesh.inference.seldon" \
      -H "Content-Type: application/json" \
      -H "Seldon-Model: tfsimple.pipeline" \
      -d '{
        "inputs": [
          {
            "name": "INPUT0",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          },
          {
            "name": "INPUT1",
            "datatype": "INT32",
            "shape": [1,16],
            "data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
          }
        ]
      }' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
        '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .
    {
      "model_name": "",
      "outputs": [
        {
          "data": [
            2,
            4,
            6,
            8,
            10,
            12,
            14,
            16,
            18,
            20,
            22,
            24,
            26,
            28,
            30,
            32
          ],
          "name": "OUTPUT0",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        },
        {
          "data": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "name": "OUTPUT1",
          "shape": [
            1,
            16
          ],
          "datatype": "INT32"
        }
      ]
    }
    
    seldon pipeline inspect tfsimple
    seldon.default.model.tfsimple1.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
    seldon.default.model.tfsimple1.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    seldon.default.pipeline.tfsimple.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
    seldon.default.pipeline.tfsimple.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    
    seldon pipeline inspect tfsimple-extended
    seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    seldon.default.pipeline.tfsimple-extended.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
    
    seldon pipeline inspect tfsimple-extended2
    seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    seldon.default.pipeline.tfsimple-extended2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
    seldon.default.pipeline.tfsimple-extended2.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
    
    seldon pipeline inspect tfsimple-combined-step
    seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    seldon.default.pipeline.tfsimple-combined-step.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
    seldon.default.pipeline.tfsimple-combined-step.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
    
    kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple-combined-step.yaml -n seldon-mesh
    kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
    kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh