Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Explore the enterprise MLOps capabilities of Core 2 for production ML deployment, featuring model serving, pipeline orchestration, and intelligent resource management for scalable ML systems.
After the models are deployed, Core 2 enables the monitoring and experimentation on those systems in production. With support for a wide range of model types, and design patterns to build around those models, you can standardize ML deployment across a range of use-cases in the cloud or on-premise serving infrastructure of your choice.
Seldon Core 2 orchestrates and scales machine learning components running as production-grade microservices. These components can be deployed locally or in enterprise-scale kubernetes clusters. The components of your ML system - such as models, processing steps, custom logic, or monitoring methods - are deployed as Models, leveraging serving solutions compatible with Core 2 such as MLServer, Alibi, LLM Module, or Triton Inference Server. These serving solutions package the required dependencies and standardize inference using the Open Inference Protocol. This ensures that, regardless of your model types and use-cases, all request and responses follow a unified format. After models are deployed, they can process REST or gRPC requests for real-time inference.
Machine learning applications are increasingly complex. They've evolved from individual models deployed as services, to complex applications that can consist of multiple models, processing steps, custom logic, and asynchronous monitoring components. With Core you can build Pipelines that connect any of these components to make data-centric applications. Core 2 handles orchestration and scaling of the underlying components of such an application, and exposes the data streamed through the application in real time using Kafka.
This approach to MLOps, influenced by our position paper , enables real-time observability, insight, and control on the behavior, and performance of your ML systems.
Lastly, Core 2 provides Experiments as part of its orchestration capabilities, enabling users to implement routing logic such as A/B tests or Canary deployments to models or pipelines in production. After experiments are run, you can promote new models and/or pipelines, or launch new experiments, so that you can continuously improve the performance of your ML applications.
In Seldon Core 2 your models are deployed on inference servers, which are software that manage the packaging and execution of ML workloads. As part its design, Core 2 separates out Servers and Models as separate resources. This approach enables flexible allocation of models to servers aligning with the requirements of your models, and to the underlying infrastructure that you want your servers to run on. Core 2 also provides functionality to autoscale your models and servers up and down as needed based on your workload requirements or user-defined metrics.
With the modular design of Core 2, users are able to implement cutting-edge methods to minimize hardware costs:
Multi-Model serving consolidates multiple models onto shared inference servers to optimize resource utilization and decrease the number of servers required.
Over-commit allows you to provision more models than the available memory would normally allow by dynamically loading and unloading models from memory to disk based on demand.
Core 2 demonstrates the power of a standardized, data-centric approach to MLOps at scale, ensuring that data observability and management are prioritized across every layer of machine learning operations. Furthermore, Core 2 seamlessly integrates into end-to-end MLOps workflows, from CI/CD, managing traffic with the service mesh of your choice, alerting, data visualization, or authentication and authorization.
This modular, flexible architecture not only supports diverse deployment patterns but also ensures compatibility with the latest AI innovations. By embedding data-centricity and adaptability into its foundation, Core 2 equips organizations to scale and improve their machine learning systems effectively, to capture value from increasingly complex AI systems.
Explore our
for updates or for answers to any questions
Discover Seldon Core 2, a Kubernetes-native MLOps framework for deploying ML and LLM systems at scale. Features flexible architecture, standardized workflows, and enhanced observability.
Seldon Core 2 is a Kubernetes-native framework for deploying and managing machine learning (ML) and Large Language Model (LLM) systems at scale. Its data-centric approach and modular architecture enable seamless handling of simple models to complex ML applications across on-premise, hybrid, and multi-cloud environments while ensuring flexibility, standardization, observability, and cost efficiency.
Seldon Core 2 offers a platform-agnostic, flexible framework for seamless deployment of different types of ML models across on-premise, cloud, and hybrid environments. Its adaptive architecture enables customizable applications, future-proofing MLOps or LLMOps by scaling deployments as data and applications evolve. The modular design enhances resource-efficiency, allowing dynamic scaling, component reuse, and optimized resource allocation. This ensures long-term scalability, operational efficiency, and adaptability to changing demands.
Seldon Core 2 enforces best practices for ML deployment, ensuring consistency, reliability, and efficiency across the entire lifecycle. By automating key deployment steps, it removes operational bottlenecks, enabling faster rollouts and allowing teams to focus on high-value tasks.
With a "learn once, deploy anywhere" approach, Seldon Core 2 standardizes model deployment across on-premise, cloud, and hybrid environments, reducing risk and improving productivity. Its unified execution framework supports conventional, foundational, and LLM models, streamlining deployment and enabling seamless scalability. It also enhances collaboration between MLOps Engineers, Data Scientists, and Software Engineers by providing a customizable framework that fosters knowledge sharing, innovation, and the adoption of new data science capabilities.
Observability in Seldon Core 2 enables real-time monitoring, analysis, and performance tracking of ML systems, covering data pipelines, models, and deployment environments. Its customizable framework combines and monitoring, ensuring teams have the key metrics needed for maintenance and decision-making.
Seldon simplifies operational monitoring, allowing real-time ML or LLM deployments to expand across organizations while supporting complex, mission-critical use cases. A ensures all prediction data is auditable, maintaining explainability, compliance, and trust in AI-driven decisions.
Seldon Core 2 is built for scalability, efficiency, and cost-effective ML operations, enabling you to deploy only the necessary components while maintaining agility and high performance. Its modular architecture ensures that resources are optimized, infrastructure is consolidated, and deployments remain adaptable to evolving business needs.
Scaling for Efficiency: infrastructure dynamically based on demand, auto-scaling for real-time use cases while scaling to zero for on-demand workloads, preserving deployment state for seamless reactivation. By eliminating redundancy and optimizing deployments, it balances cost efficiency and performance, ensuring reliable inference at any scale.
Consolidated Serving Infrastructure: maximizes resource utilization with multi-model serving (MMS) and overcommit, reducing compute overhead while ensuring efficient, reliable inference.
Extendability & Modular Adaptation: integrates with LLMs, Alibi, and other modules, enabling on-demand ML expansion. Its modular design ensures scalable AI, maximizing value extraction, agility, and cost efficiency.
Reusability for Cost Optimization: provides predictable, fixed pricing, enabling cost-effective scaling and innovation while ensuring financial transparency and flexibility.
Explore our other
Learn about the
for updates or for answers to any questions



Learn how Seldon Core is different from a centralized orchestrator to a data flow architecture, enabling more efficient ML model deployment through stream processing and improved data handling.
In the context of machine learning and Seldon Core 2, concepts provide a framework for understanding key functionalities, architectures, and workflows within the system. Some of the key concepts in Seldon Core 2 are:
Data-centricity is an approach that puts the management, integrity, and flow of data at the core of machine learning deployment. Rather than focusing solely on models, this approach ensures that data quality, consistency, and adaptability drive successful ML operations. In Seldon Core 2, data-centricity is embedded in every stage of the inference workflow.
Why Data-Centricity Matters?
By adopting a data-centric approach, Seldon Core 2 enables:
More reliable and high-quality predictions by ensuring clean, well-structured data.
Scalable and future-proof ML deployments through standardized data management.
Efficient monitoring and maintenance, reducing risks related to model drift and inconsistencies.
With data-centricity as a core principle, Seldon Core 2 ensures end-to-end control over ML workflows, enabling you to maximize model performance and reliability in production environments.
The Open Inference Protocol (OIP) defines a standard way for inference servers and clients to communicate in Seldon Core 2. Its goal is to enable interoperability, flexibility, and consistency across different model-serving runtimes. It exposes the Health API, Metadata API, and the Inference API.
Some of the features of Open Inference Protocol includes:
Transport Agnostic: Servers can implement HTTP/REST or gRPC protocols.
Runtime Awareness: Use the protocolVersion field in your runtime YAML or consult the supported runtimes table to verify compatibility.
By adopting OIP, Seldon Core 2 promotes a consistent experience across a diverse set of model deployments.
Components are the building blocks of an inference graph, processing data at various stages of the ML inference pipeline. They provide reusable, standardized interfaces, making it easier to maintain and update workflows without disrupting the entire system. Components include ML models, data processors, routers, and supplementary services.
Types of Components
By modularizing inference workflows, components allow you to scale, experiment, and optimize ML deployments efficiently while ensuring data consistency and reliability.
In a model serving platform like Seldon Core 2, a pipeline refers to an orchestrated sequence of models and components that work together to serve more complex AI applications. These pipelines allow you to connect models, routers, transformers, and other inference components, with Kafka used to stream data between them. Each component in the pipeline is modular and independently managed—meaning it can be scaled, configured, and updated separately based on its specific input and output requirements.
This use of "pipeline" is distinct from how the term is used in MLOps for CI/CD pipelines, which automate workflows for building, testing, and deploying models. In contrast, Core 2 pipelines operate at runtime and focus on the live composition and orchestration of inference systems in production.
In Core 2, servers are responsible for hosting and serving machine learning models, handling inference requests, and ensuring scalability, efficiency, and observability in production. Core 2 supports multiple inference servers, including MLServer, and NVIDIA Triton Inference Server, enabling flexible and optimized model deployments.
MLServer: A lightweight, extensible inference server designed to work with multiple ML frameworks, including scikit-learn, XGBoost, TensorFlow, and PyTorch. It supports custom Python models and integrates well with MLOps workflows. It is built for *General-purpose model serving, custom model wrappers, multi-framework support.
In Seldon Core 2, experiments enable controlled A/B testing, model comparisons, and performance evaluations by defining an HTTP traffic split between different models or inference pipelines. This allows organizations to test multiple versions of a model in production while managing risk and ensuring continuous improvements.
Some of the advantages of using Experiments:
A/B Testing & Model comparison: Compare different models under real-world conditions without full deployment.
Risk-Free model validation: Test a new model or pipeline in parallel without affecting live predictions.
Performance monitoring: Assess latency, accuracy, and reliability before a full rollout.
Continuous improvement: Make data-driven deployment decisions based on real-time model performance.
Learn about the Seldon Core 2 microservice architecture, control plane and data plane components, and how these services work to provide a scalable, fault-tolerant ML model serving and management.
Seldon Core 2 uses a microservice architecture where each service has limited and well-defined responsibilities working together to orchestrate scalable and fault-tolerant ML serving and management. These components communicate internally using gRPC and they can be scaled independently. Seldon Core 2 services can be split into two categories:
Control Plane services are responsible for managing the operations and configurations of your ML models and workflows. This includes functionality to instantiate new inference servers, load models, update new versions of models, configure model experiments and pipelines, and expose endpoints that may receive inference requests. The main control plane component is the Scheduler that is responsible for managing the loading and unloading of resources (models, pipelines, experiments) onto the respective components.
Data Plane services are responsible for managing the flow of data between components or models. Core 2 supports REST and gRPC payloads that follow the Open Inference Protocol (OIP). The main data plane service is
The current set of services used in Seldon Core 2 is shown below. Following the diagram, we will describe each control plane and data plan service.
This service manages the loading and unloading of Models, Pipelines and Experiments on the relevant micro services. It is also responsible for matching Models with available Servers in a way that optimises infrastructure use. In the current design we can only have one instance of the Scheduler as its internal state is persisted on disk.
When the Scheduler (re)starts there is a synchronisation flow to coordinate the startup process and to attempt to wait for expected Model Servers to connect before proceeding with control plane operations. This is important so that ongoing data plane operations are not interrupted. This then introduces a delay on any control plane operations until the process has finished (including control plan resources status updates). This synchronisation process has a timeout, which has a default of 10 minutes. It can be changed by setting helm seldon-core-v2-components value scheduler.schedulerReadyTimeoutSeconds.
This service manages the loading and unloading of models on a server and access to the server over REST/gRPC. It acts as a reverse proxy to connect end users with the actual Model Servers. In this way the system collects stats and metrics about data plane inferences that helps with observability and scaling.
We also provide a Kubernetes Operator to allow Kubernetes usage. This is implemented in the Controller Manager microservice, which manages CRD reconciliation with Scheduler. Currently Core 2 supports one instance of the Controller.
This service handles REST/gRPC calls to Pipelines. It translates between synchronous requests to Kafka operations, producing a message on the relevant input topic for a Pipeline and consuming from the output topic to return inference results back to the users.
This service handles the flow of data from models to inference requests on servers and passes on the responses via Kafka.
This service handles the flow of data between components in a pipeline, using Kafka Streams. It enables Core 2 to chain and join Models together to provide complex Pipelines.
Envoy acts as the request proxy for Seldon Core 2, routing traffic to the appropriate inference servers and handling load balancing across available replicas. Envoy configuration of Seldon Core 2 uses weighted least-request load balancing, which dynamically distributes traffic based on both replica weights and current load, helping ensure efficient and stable request routing.
To support the movement towards data centric machine learning Seldon Core 2 follows a dataflow paradigm. By taking a decentralized route that focuses on the flow of data, users can have more flexibility and insight as they build and manage complex AI applications in production. This contrasts with more centralized orchestration approaches where data is secondary.
Kafka is used as the backbone for Pipelines allowing decentralized, synchronous and asynchronous usage. This enables Models to be connected together into arbitrary directed acyclic graphs. Models can be reused in different Pipelines. The flow of data between models is handled by the dataflow engine using Kafka Streams.
By focusing on the data we allow users to join various flows together using stream joining concepts as shown below.
We support several types of joins:
inner joins, where all inputs need to be present for a transaction to join the tensors passed through the Pipeline;
outer joins, where only a subset needs to be available during the join window
triggers, in which data flows need to wait until records on one or more trigger data flows appear. The data in these triggers is not passed onwards from the join.
These techniques allow users to create complex pipeline flows of data between machine learning components.
More discussion on the data flow view of machine learning and its effect on v2 design can be found here.
Key Principle
Description
Flexible Workflows
Core 2 supports adaptable and scalable data pathways, accommodating various use cases and experiments. This ensures ML pipelines remain agile, allowing you to evolve the inference logic as requirements change..
Real-Time Data Streaming
Integrated data streaming capabilities allow you to view, store, manage, and process data in real time. This enhances responsiveness and decision-making, ensuring models work with the most up-to-date data for accurate predictions.
Standardized Processing
Core 2 promotes reusable and consistent data transformation and routing mechanisms. Standardized processing ensures data integrity and uniformity across applications, reducing errors and inconsistencies.
Comprehensive Monitoring
Detailed metrics and logs provide real-time visibility into data integrity, transformations, and flow. This enables effective oversight and maintenance, allowing teams to detect anomalies, drifts, or inefficiencies early.
Component Type
Description
Data Processors
Transform, filter, or aggregate data to ensure consistent, repeatable pre-processing.
Data Routers
Dynamically route data to different paths based on predefined rules for A/B testing, experimentation, or conditional logic.
Models
Perform inference tasks, including classification, regression, and Large Language Models (LLMs), hosted internally or via external APIs.
Supplementary Data Services
External services like vector databases that enable models to access embeddings and extended functionality.
Drift/Outlier Detectors & Explainers
Monitor model predictions for drift, anomalies, and explainability insights, ensuring transparency and performance tracking.
Experiment Type
Description
Traffic Splitting
Distributes inference requests across different models or pipelines based on predefined percentage splits. This enables A/B testing and comparison of multiple model versions. For example, Canary deployment of the models.
Mirror Testing
Sends a percentage of the traffic to a mirror model or pipeline without affecting the response returned to users. This allows evaluation of new models without impacting production workflows. For example, Shadow deployment of the models.

Learn how to configure and use Rclone for model artifact storage in Seldon Core, including cloud storage integration and authentication.
We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.
For authorization needed for cloud storage when running on Kubernetes see here.
Learn how to install Seldon Core 2 in various environments - from local development with Docker Compose to production-grade Kubernetes clusters.
Seldon Core 2 can be installed in various setups to suit different stages of the development lifecycle. The most common modes include:
Ideal for development and testing purposes, a local setup allows for quick iteration and experimentation with minimal overhead. Common tools include:
Docker Compose: Simplifies deployment by orchestrating Seldon Core components and dependencies in Docker containers. Suitable for environments without Kubernetes, providing a lightweight alternative.
Kind (Kubernetes IN Docker): Runs a Kubernetes cluster inside Docker, offering a realistic testing environment. Ideal for experimenting with Kubernetes-native features.
Designed for high-availability and scalable deployments, a production setup ensures security, reliability, and resource efficiency. Typical tools and setups include:
Managed Kubernetes Clusters: Platforms like GKE (Google Kubernetes Engine), EKS (Amazon Elastic Kubernetes Service), and AKS (Azure Kubernetes Service) provide managed Kubernetes solutions. Suitable for enterprises requiring scalability and cloud integration.
On-Premises Kubernetes Clusters: For organizations with strict compliance or data sovereignty requirements. Can be deployed on platforms like OpenShift or custom Kubernetes setups.
By selecting the appropriate installation mode—whether it's Docker Compose for simplicity, Kind for local Kubernetes experimentation, or production-grade Kubernetes for scalability—you can effectively leverage Seldon Core 2 to meet your specific needs.
For more information, see the published .
For the description of (some) values that can be configured for these charts, see this .
Here is a list of components that Seldon Core 2 requires, along with the minimum and maximum supported versions:
Notes:
Envoy and Rclone: These components are included as part of the Seldon Core 2 Docker images. You are not required to install them separately but must be aware of the configuration options supported by these versions.
Kafka: Only required for operating Seldon Core 2 dataflow Pipelines. If not needed, you should avoid installing seldon-modelgateway, seldon-pipelinegateway, and seldon-dataflow-engine.
Maximum Versions
For Kubernetes usage we provide a set of custom resources for interacting with Seldon.
- for installing Seldon in a particular namespace.
- for deploying sets of replicas of core inference servers (MLServer or Triton).
- for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.
Seldon Core provides native autoscaling features for both Models and Servers, enabling automatic scaling based on inference load. The diagram below depicts an autoscaling implementation that uses both Model and Server autoscaling features native to Seldon Core (i.e. this implementation doesn't leverage HPA for autoscaling, an approach we cover )
Models can automatically scale their replicas based on load. Enable it by setting MinReplicas or MaxReplicas in your model spec. For more detail on this setup see
Experiments - for testing new versions of models.
Pipelines - for connecting together flows of data between models.
SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.
ServerConfigs - for defining new types of inference server that can be reference by a Server resource.
SeldonConfig - for defining how seldon is installed
Kafka
3.4
3.8
Recommended (only required for operating Seldon Core 2 dataflow Pipelines)
Prometheus
2.0
2.x
Optional
Grafana
10.0
***
Optional (no hard limit on the maximum version to be used)
Prometheus-adapter
0.12
0.12
Optional
Opentelemetry Collector
0.68
***
Optional (no hard limit on the maximum version to be used)
***Name of the Helm Chart
Description
seldon-core-v2-crds
Cluster-wide installation of custom resources.
seldon-core-v2-setup
Installation of the manager to manage resources in the namespace or cluster-wide. This also installs default SeldonConfig and ServerConfig resources, allowing Runtimes and Servers to be installed on demand.
seldon-core-v2-runtime
Installs a SeldonRuntime custom resource that creates the core components in a namespace.
seldon-core-v2-servers
Installs Server custom resources providing example core servers to load models.
Component
Minimum Version
Maximum Version
Notes
Kubernetes
1.27
1.33.0
Required
Envoy*
1.32.2
1.32.2
Required. (Included in Core 2 installation)
Rclone*
1.68.2
1.69.0
Required. (Included in Core 2 installation)
Server autoscaling automatically scales Servers based on Model needs. This implementation supports scaling in a Multi-Model Serving setup where multiple models are hosted on shared inference servers. For more detail on this setup see here

This page provides guidance about scaling Seldon Core 2 services
Seldon Core 2 runs with several control and dataplane components. The scaling of these resources is discussed below:
Pipeline gateway: The pipeline gateway handles REST and gRPC synchronous requests to Pipelines. It is stateless and can be scaled based on traffic demand.
Model gateway: This component pulls model requests from Kafka and sends them to inference servers. It can be scaled up to the partition factor of your Kafka topics. At present we set a uniform partition factor for all topics in one installation of Seldon.
Dataflow engine: The dataflow engine runs KStream topologies to manage Pipelines. It can run as multiple replicas and the scheduler will balance Pipelines to run across it with a consistent hashing load balancer. Each Pipeline is managed up to the partition factor of Kafka (presently hardwired to one). We recommend using as many replicas of dataflow-engine as you have Kafka partitions in order to leverage the balanced distribution of inference traffic using hashing
Scheduler: The scheduler manages the control plane operations. It is presently required to be one replica as it maintains internal state within a BadgerDB held on local persistent storage (stateful set in Kubernetes). Performance tests have shown this not to be a bottleneck at present.
Kubernetes Controller: The Kubernetes controller manages resources updates on the cluster which it passes on to the Scheduler. It is by default one replica but has the ability to scale.
Envoy: Envoy replicas get their state from the scheduler for routing information and can be scaled as needed.
Install Seldon Core 2 in a local learning environment.
You can install Seldon Core 2 on your local computer that is running a Kubernetes cluster using kind.
Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies,see Helm charts and Dependencies.
Install a Kubernetes cluster that is running version 1.27 or later.
Install , the Kubernetes command-line tool.
Install , the package manager for Kubernetes or , the automation tool used for provisioning, configuration management, and application deployment.
Create a namespace to contain the main components of Seldon Core 2. For example, create the seldon-mesh namespace.
Add and update the Helm charts, seldon-charts, to the repository.
Install Custom resource definitions for Seldon Core 2.
If you installed Seldon Core 2 using Helm, you need to complete the installation of other components in the following order:
Learn how to configure multi-model serving in Seldon Core, including resource optimization and model co-location.
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.
This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.
Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.
Seldon can handle a wide range of
Artifacts can be stored on any of the 40 or more cloud storage technologies as well as from local (mounted) folder as discussed .
The Model specification allows parameters to be passed to the loaded model to allow customization. For example:
This capability is only available for MLServer custom model runtimes. The named keys and
values will be added to the model-settings.json file for the provided model in theparameters.extra Dict. MLServer models are able to read these values in their load method.
In MLOps, system performance can be defined as how efficiently and effectively an ML model or application operates in a production environment. It is typically measured across several key dimensions, such as latency, throughput, scalability, and resource-efficiency. These factors are deeply connected: changes in configuration often result in tradeoffs, and should be carefully considered. Specifically, latency, throughput and resource usage can all impact each other, and the approach to optimizing system performance depends on the desired balance of these outcomes in order to ensure a positive end-user experience, while also minimising infrastructure costs.
There are many different levers that can be considered to tune performance for ML systems deployed with Core 2, across infrastructure, models, inference execution, and the related configurations exposed by Core 2. When reasoning about the performance of an ML-based system deployed using Seldon, we recommend breaking down the problem by first understanding and tuning the performance of deployed Models, and then subsequently considering more complex Pipelines composed of those models (if applicable). For both models and pipelines, it is important to run tests to understand baseline performance characteristics, before making efforts to tune these through changes to models, infrastructure, or inference configurations. The recommended approach can be broken down as follows:
Overview of Horizontal Pod Autoscaler (HPA) scaling options in Seldon Core 2
Given Seldon Core 2 is predominantly for serving ML in Kubernetes, it is possible to leverage HorizontalPodAutoscaler or to define scaling logic that automatically scales up and down Kubernetes resources. HPA targets Kubernetes or custom metrics to trigger scale-up or scale-down events for specified resources. Using HPA is recommnended if custom Scaling Metrics are required. These would be exposed using Prometheus, and or similar tools for explosing metrics to HPA. If these tools cause conflicts, that does not require exposing custom metrics is recommended.
Seldon Core 2 provides two main approaches to leveraging Kubernetes Horizontal Pod Autoscaler (HPA) for autoscaling. It is important to remember that since in Core 2 Models and Servers are separate, autoscaling of both Models and Servers, in a coordinated way, needs to be accounted for when implementing autoscaling. In order to implement either approach, metrics first need to be exposed - this is explained in the guide which explains the fundamental requirements and configuration needed to enable HPA-based scaling in Seldon Core 2.
Custom metrics: allows you to define and track custom metrics specific to their models and use cases.
Visualization: Provides dashboards and visualizations to easily observe the status and performance of models.
There are two kinds of metrics present in Seldon Core 2 that you can monitor:
Operational metrics describe the performance of components in the system. Some examples of common operational considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates. Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.
Usage metrics describe the system at a higher and less dynamic level. Some examples include the number of deployed servers and models, and component versions. These are not typically metrics that engineers need insight into, but may be relevant to platform providers and operations teams.
# samples/models/choice1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-one
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 1"Model, the Scheduler will find an appropriate model inference server instance
to load the model onto.In the below example, given that the model is tensorflow, the system will deploy the model onto atriton server instance (matching with the Server labels). Additionally as the model memory
requirement is 100Ki, the system will pick the server instance that has enough (memory) capacity to
host this model in parallel to other potentially existing models.
All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.
Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.
Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE on shared servers; it is set by
default at 10%. In other words a given model inference server instance can register models with a
total memory requirement up to MEMORY_REQUEST * ( 1 + SELDON_OVERCOMMIT_PERCENTAGE / 100).
The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:
A new model load request beyond the active memory capacity of the inference server.
An incoming inference request to a registered model that is not loaded in-memory (previously evicted).
This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).
Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE to 0 for a given shared server.
Check this notebook for a local example.
A Kubernetes yaml example is shown below for a SKLearn model for iris classification:
Its Kubernetes spec has two core requirements
A storageUri specifying the location of the artifact. This can be any rclone URI specification.
A requirements list which provides tags that need to be matched by the Server that can run
this artifact type. By default when you install Seldon we provide a set of Servers that cover a
range of artifact types.
You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:
The proto buffer definitions for the scheduler are outlined here.
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.
Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying
models. The system will find an appropriate server to load the model onto based on requirements that
the user defines in the Model deployment definition.
Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.
See Multi-model serving for more information.
See here for discussion of autoscaling of models.
See here for details on how Core 2 schedules Models onto Servers.
# samples/models/tfsimple1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
requirements:
- sklearn
memory: 100Ki!grpcurl -d '{"model":{ \
"meta":{"name":"iris"},\
"modelSpec":{"uri":"gs://seldon-models/mlserver/iris",\
"requirements":["sklearn"],\
"memoryBytes":500},\
"deploymentSpec":{"replicas":1}}}' \
-plaintext \
-import-path ../../apis \
-proto apis/mlops/scheduler/scheduler.proto 0.0.0.0:9004 seldon.mlops.scheduler.Scheduler/LoadModelseldon-mesh namespace.This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.
You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.
Install Seldon Core 2 runtimes in the seldon-mesh namespace.
helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
--installInstall Seldon Core 2 servers in the seldon-mesh namespace. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.
helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
--namespace seldon-mesh \
--installCheck Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the seldon-mesh namespace. It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh.
kubectl get pods -n seldon-meshThe output should be similar to this:
NAME READY STATUS RESTARTS AGE
hodometer-749d7c6875-4d4vw 1/1 Running 0 4m33s
mlserver-0 3/3 Running 0 4m10s
seldon-dataflow-engine-7b98c76d67-v2ztq 0/1 CrashLoopBackOff 5 (49s ago) 4m33s
seldon-envoy-bb99f6c6b-4mpjd 1/1 Running 0 4m33s
seldon-modelgateway-5c76c7695b-bhfj5 1/1 Running 0 4m34s
seldon-pipelinegateway-584c7d95c-bs8c9 1/1 Running 0 4m34s
seldon-scheduler-0 1/1 Running 0 4m34s
seldon-v2-controller-manager-5dd676c7b7-xq5sm 1/1 Running 0 4m52s
triton-0 2/3 Running 0 4m10sYou can install Seldon Core 2 and its components using Ansible in one of the following methods:
To install Seldon Core 2 into a new local kind Kubernetes cluster, you can use the seldon-all playbook with a single command:
This creates a kind cluster and installs ecosystem dependencies such kafka,
Prometheus, OpenTelemetry, and Jaeger as well as all the seldon-specific components.
The seldon components are installed using helm-charts from the current git
checkout (../k8s/helm-charts/).
Internally this runs, in order, the following playbooks:
kind-cluster.yaml
setup-ecosystem.yaml
setup-seldon.yaml
You may pass any of the additonal variables which are configurable for those playbooks to seldon-all.
For example:
Running the playbooks individually gives you more control over what and when it runs. For example, if you want to install into an existing k8s cluster.
Create a kind cluster.
Setup ecosystem.
Seldon runs by default in the seldon-mesh namespace and a Jaeger pod and OpenTelemetry collector are installed in the namespace.
To install in a different <mynamespace> namespace:
Install Seldon Core 2 in the ansible/ folder.
kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"helm repo add seldon-charts https://seldonio.github.io/helm-charts/
helm repo update seldon-chartshelm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
--namespace default \
--install helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace seldon-mesh --set controller.clusterwide=true \
--installLoad testing to understand latency and throughput behaviour for one model replica
Tuning performance for models. This can be done via changes to:
Infrastructure - choosing the right hardware, and configurations in Core related to CPUs, GPUs and memory.
Models - optimizing model artefacts in how they are structured, configured, stored. This can include model pruning, quantization, consideration of different model frameworks, and making sure that the model can achieve a high utilisation of the allocated resources.
- the way in which inference is executed. This can include the choice of communication protocols (REST, gRPC), payload configuration, batching, and efficient execution of concurrent requests.
Testing Pipelines to identify the critical path based on performance of underlying models
Core 2 Configuration to optimize data-processing through pipelines
Scalability of Pipelines to understand how Core 2 components scale with the number of deployed pipelines and models
The Model Autoscaling with HPA approach enables users to scale Models based on custom metrics. This approach, along with Server Autoscaling, enables users to customize the scaling logic for models, and automate the scaling of Servers based on the needs of the Models hosted on them.
The Model and Server Autoscaling with HPA approach leverages HPA to autoscale for Models and Servers in a coordinated way. This requires a 1-1 Mapping of Models and Servers (no Multi-Model Serving). In this case, setting up HPA can be set up for a Model and its associated Server, targetting the same custom metric (this is possible for Kubernetes-native metrics).
s
Learn how to use PandasQuery for data transformation in Seldon Core, including query configuration and parameter handling.
This model allows a Pandas query to be run in the input to select rows. An example is shown below:
# samples/models/choice1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-one
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 1"This invocation check filters for tensor A having value 1.
The model also returns a tensor called status which indicates the operation run and whether it
was a success. If no rows satisfy the query then just a status tensor output will be returned.
For further details, see Pandas query
This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:
and
By including these in a Pipeline as follows we can define conditional routes:
Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.
For more details, see the Pandas query .
Learn how to configure model scheduling in Seldon Core, including resource allocation, scaling policies, and deployment strategies.
Core 2 architecture is built around decoupling Model and Server CRs to allow for multi-model deployment,
enabling multiple models to be loaded and served on one server replica or a single Pod. Multi-model serving
allows for more efficient use of resources, see Multi Model Serving for more information.
This architecture requires that Core 2 handles scheduling of models onto server pods natively. In particular Core 2 implements different sorters and filters which are used to find the best Server that is able to host a given Model. This process we describe in the following section.
The scheduling process in Core 2 identifies a suitable candidate server for a given model through a series of steps. These steps involve sorting and filtering servers primarily based on the following criteria:
Server has matching Capabilities with Model spec.requirements.
Server has enough replicas to load the desired spec.replicas of the Model.
Each replica of Server has enough available memory to load one replica of the model defined in spec.memory.
After a suitable candidate server is identified for a given model, Core 2 attempts to load the model onto it. If no matching server is found, the model is marked as ScheduleFailed.
This process is designed to be extensible, allowing for the addition of new filters in future versions to enhance scheduling decisions.
Core 2 (from 2.9) is able to do partial scheduling of Models. Partial scheduling is defined as the loading of enough replicas of the model above spec.minReplicas and upto the number of available Server replicas. This allows the user a little bit more flexibility in serving traffic while optimising infrastructure provisioning.
To enable partial scheduling, spec.minReplicas needs to be defined as it provides Core 2 the minimum replicas of the model that is required for serving.
Partial scheduling does not have an explicit state; instead, the model is marked as ModelAvailable and is ready to serve traffic. The status of the Model CR can be inspected, where DESIRED REPLICAS and AVAILABLE REPLICAS provide insight into the number of replicas currently loaded in Core 2. Based on this information, the following logic are applied:
Fully Scheduled: READY is True and DESIRED REPLICAS is equal to AVAILABLE REPLICAS (STATUS is ModelAvailable)
Partially Scheduled: READY is TRUE and DESIRED REPLICAS is greater than AVAILABLE REPLICAS (STATUS is ModelAvailable)
Not Scheduled: Ready is False (Status is ScheduleFailed)
Learn how to manage Kafka topics in Seldon Core 2, including topic creation, configuration, and monitoring for model inference and event streaming.
A in Seldon Core 2 represents the fundamental unit for serving a machine learning artifact within a running server instance.
If Kafka is installed in your cluster, Seldon Core automatically creates dedicated input and output topics for each model as it is loaded. These topics facilitate asynchronous messaging, enabling clients to send input messages and retrieve output responses independently and at a later time.
By default, when a model is unloaded, the associated Kafka topics are preserved. This supports use cases like auditing, but can also lead to increased Kafka resource usage and unnecessary costs for workloads that don't require persistent topics.
You can control this behavior by configuring the
Before looking to make changes to improve latency or throughput of your models, it is important to undergo load testing to understand the existing performance characteristics of your model(s) when deployed onto the chosen inference server (MLServer or Triton). The goal of load testing should be to understand the performance behavior of deployed models in saturated (i.e at the maximum throughput a model replica can handle) and non-saturated regimes, then compare it with expected latency objectives.
The results here will also inform the setup of autoscaling parameters, with the target being to run each replica with some margin below the saturation throughput (say, by 10-20%) in order to ensure that latency does not degrade, and that there is sufficient capacity to absorb load for the time it takes for new inference server replicas (and model replicas) to become available.
An Experiment defines a traffic split between Models or Pipelines. This allows new versions of models and pipelines to be tested.
An experiment spec has three sections:
candidates (required) : a set of candidate models to split traffic.
default (optional) : an existing candidate who endpoint should be modified to split traffic as defined by the candidates.
Seldon Core 2 provides multiple approaches to scaling your machine learning deployments, allowing you to optimize resource utilization and handle varying workloads efficiently. In Core 2, we separate out Models and Servers, and Servers can have multiple Models loaded on them (Multi-Model Serving). Given this, setting up autoscaling requires defining the logic by which you want to scale your Models and then configuring the autoscaling of Servers such that they autoscale in a coordinated way. The following steps can be followed to set up autoscaling based on specific requirements:
Identify metrics that you want to scale Models on. There are a couple of different options here:
Learn how to implement model explainability in Seldon Core using Alibi-Explain integration for black box model explanations and pipeline insights.
Explainers are Model resources with some extra settings. They allow a range of explainers from the Alibi-Explain library to be run on MLServer.
An example Anchors explainer definitions is shown below.
The key additions are:
type: This must be one of the
supported by the Alibi Explain runtime in MLServer.
modelRef
Server that already hosts the Model is preferred to reduce flip-flops between different candidate servers.


# samples/models/choice1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-one
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 1"<mynamespace>.dataflowstorageUrirequirementscleanTopicsOnDeleteWhen set to false (the default), the topics remain after the model is deleted.
When set to true, both the input and output topics are removed when the model is unloaded.
Here is an example of a manifest file that enables topic cleanup on deletion:
To inspect existing Kafka topics in your cluster, you can deploy a temporary Pod:
After the Pod is running, you can access it and list topics with the following command:
Apply the model manifest with topic cleanup enabled:
After deployment, you can list Kafka topics from within the kafka-busybox pod and confirm that input/output topics have been created:
To delete the model:
After deletion, list the topics again. You should see that the input and output topics have been successfully removed from Kafka:
Similar to models, when a Pipeline is deployed in Seldon Core 2, Kafka input and output topics are automatically created for it. These topics enable asynchronous processing across pipeline steps.
As with models, the cleanTopicsOnDelete flag controls whether these topics are retained or removed when the pipeline is deleted:
By default, topics are retained after the pipeline is unloaded.
When cleanTopicsOnDelete is set to true, the input and output topics associated with the pipeline are deleted.
Here is an example of a pipeline manifest that wraps the previously defined model and enables topic cleanup:
Apply the pipeline manifest with topic cleanup enabled:
After the pipeline is deployed, you can list the Kafka topics from inside the kafka-busybox pod to confirm that they have been created:
To delete the pipeline, run:
After deletion, list the Kafka topics again. You should observe that the pipeline's input and output topics have been removed:
pipelineRef: The pipeline name for black box explainers.
Only one of modelRef and pipelineRef is allowed.
Blackbox explainers can explain a Pipeline as well as a model. An example from the Huggingface sentiment demo is show below.
# samples/models/choice2.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-two
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 2"# samples/pipelines/choice.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: choice
spec:
steps:
- name: choice-is-one
- name: mul10
inputs:
- choice.inputs.INPUT
triggers:
- choice-is-one.outputs.choice
- name: choice-is-two
- name: add10
inputs:
- choice.inputs.INPUT
triggers:
- choice-is-two.outputs.choice
output:
steps:
- mul10
- add10
stepsJoin: anyansible-playbook playbooks/seldon-all.yamlansible-playbook playbooks/seldon-all.yaml -e seldon_mesh_namespace=my-seldon-mesh -e install_prometheus=no -e @playbooks/vars/set-custom-images.yamlansible-playbook playbooks/kind-cluster.yamlansible-playbook playbooks/setup-ecosystem.yamlansible-playbook playbooks/setup-ecosystem.yaml -e seldon_mesh_namespace=<mynamespace>ansible-playbook playbooks/setup-seldon.yamlansible-playbook playbooks/setup-seldon.yaml -e seldon_mesh_namespace=<mynamespace># samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
dataflow:
cleanTopicsOnDelete: true
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
requirements:
- sklearn
memory: 100KiapiVersion: v1
kind: Pod
metadata:
name: kafka-busybox
spec:
containers:
- name: kafka-busybox
image: apache/kafka:latest
command: ["sleep", "3600"]
imagePullPolicy: IfNotPresent
restartPolicy: Alwayskafka-busybox:/opt/kafka/bin$ ./kafka-topics.sh --list --bootstrap-server $SELDON_KAFKA_BOOTSTRAP_PORT_9092_TCPkubectl apply -f model.yaml -n seldon-mesh__consumer_offsets
seldon.seldon-mesh.model.iris.inputs
seldon.seldon-mesh.model.iris.outputskubectl delete -f model.yaml -n seldon-mesh__consumer_offsetsapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: iris-pipeline
spec:
dataflow:
cleanTopicsOnDelete: true
steps:
- name: iris
output:
steps:
- iriskubectl apply -f pipeline.yaml -n seldon-mesh__consumer_offsets
seldon.seldon-mesh.errors.errors
seldon.seldon-mesh.model.iris.inputs
seldon.seldon-mesh.model.iris.outputs
seldon.seldon-mesh.pipeline.iris-pipeline.inputs
seldon.seldon-mesh.pipeline.iris-pipeline.outputskubectl delete -f pipeline.yaml -n seldon-mesh__consumer_offsets
seldon.seldon-mesh.errors.errors
seldon.seldon-mesh.model.iris.inputs
seldon.seldon-mesh.model.iris.outputs# samples/models/income-explainer.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-explainer
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/income-sklearn/anchor-explainer"
explainer:
type: anchor_tabular
modelRef: income# samples/models/hf-sentiment-explainer.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
explainer:
type: anchor_text
pipelineRef: sentiment-explainInstall Helm, the package manager for Kubernetes.
To use Seldon Core 2 in a production environment:
Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies, see Helm charts and Dependencies.
Create a namespace to contain the main components of Seldon Core 2. For example, create the namespace seldon-mesh:
Create a namespace to contain the components related to monitoring. For example, create the namespace seldon-monitoring:
Add and update the Helm charts seldon-charts to the repository.
Install custom resource definitions for Seldon Core 2.
Install Seldon Core 2 operator in the seldon-mesh namespace.
This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.
With cluster-wide installation, you can specify the namespaces to watch by setting controller.watchNamespaces to a comma-separated list of namespaces (e.g., {ns1, ns2}). This allows the Seldon Core 2 operator to monitor and manage resources in those namespaces.
You can also install multiple operators in different namespaces, and configure them to watch a disjoint set of namespaces. For example, you can install two operators in op-ns1 and op-ns2, and configure them to watch ns1, ns2 and ns3, ns4, respectively, using the following commands:
We now install the first operator in op-ns1 and configure it to watch ns1 and ns2:
Next, we install the second operator in op-ns2 and configure it to watch ns3 and ns4:
Note that the second operator is installed with skipClusterRoleCreation=true to avoid re-creating the ClusterRole and ClusterRoleBinding that were created by the first operator.
Finally, you can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.
Install Seldon Core 2 runtimes in the namespace seldon-mesh.
Install Seldon Core 2 servers in the namespace seldon-mesh. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.
Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the namespace seldon-mesh:
The output should be similar to this:
You can integrate Seldon Core 2 with Kafka that is self-hosted or a managed Kafka.
Drift detectors.
Outlier detectors.
Explainers
Adversarial detectors.
A typical workflow for a production machine learning setup might be as follows:
You create a Tensorflow model for your core application use case and test this model in isolation to validate.
You create SKLearn feature transformation component before your model to convert the input into the correct form for your model. You also create Drift and Outlier detectors using Seldon's open source Alibi-detect library and test these in isolation.
You join these components together into a Pipeline for the final production setup.
These steps are shown in the diagram below:
This section will provide some examples to allow operations with Seldon to be tested so you can run your own models, experiments, pipelines and explainers.
A first target of load testing should be determining the maximum throughput that one model replica is able to sustain. We’ll refer to this as the (single replica) load saturation point. Increasing the number of inference requests per second (RPS) beyond this point degrades latency due to queuing that occurs at the bottleneck (a model processing step, contention on a resource such as CPU, memory, network, etc).
We recommend determining the load saturation point by running an open-loop load test that goes through a series of stages, each having a target RPS. A stage would first linearly ramp-up to its target RPS and then hold that RPS level constant for a set amount of time. The target RPS should monotonically increase between the stages.
Typically, load testing tools generate load by creating a number of "virtual users" that send requests. Knowing the behaviour of those virtual users is critical when interpreting the results of the load test. Load tests can be set up in closed-loop mode, meaning that when each virtual user sends a request and they wait for a response before sending the next one. Alternatively, there is open-loop mode, where a variable number of users are instantiated in order to maintain a constant overall RPS.
When running in closed-loop mode, an undesired side-effect called coordinated omission appears: when the system gets overloaded and latency spikes up, the load test users effectively reduce the actual load on the system by sending requests less frequently
When using closed-loop mode in load testing, be aware that reported latencies at a given throughput may be significantly smaller than what will be experienced in reality. In contrast, an open-loop load tester would maintain a constant RPS, resulting in a more accurate representation of the latencies that will be experienced when running the model in production.
Each candidate has a traffic weight. The percentage of traffic will be this weight divided by the sum of traffic weights.
mirror (optional) : a single model to mirror traffic to the candidates. Responses from this model
will not be returned to the caller.
An example experiment with a defaultModel is shown below:
This defines a split of 50% traffic between two models iris and iris2. In this case we want to
expose this traffic split on the existing endpoint created for the iris model. This allows us to
test new versions of models (in this case iris2) on an existing endpoint (in this case iris).
The default key defines the model whose endpoint we want to change. The experiment will become
active when both underplying models are in Ready status.
An experiment over two separate models which exposes a new API endpoint is shown below:
To call the endpoint add the header seldon-model: <experiment-name>.experiment in this case:seldon-model: experiment-iris.experiment. For example with curl:
For examples see the local experiments notebook.
Running an experiment between some pipelines is very similar. The difference is resourceType: pipeline
needs to be defined and in this case the candidates or mirrors will refer to pipelines. An example is
shown below:
For an example see the local experiments notebook.
A mirror can be added easily for model or pipeline experiments. An example model mirror experiment is shown below:
For an example see the local experiments notebook.
An example pipeline mirror experiment is shown below:
For an example see the local experiments notebook.
To allow cohorts to get consistent views in an experiment each inference request passes back
a response header x-seldon-route which can be passed in future requests to an experiment to
bypass the random traffic splits and get a prediction from the sequence of models and pipelines
used in the initial request.
Note: you must pass the normal seldon-model header along with the x-seldon-route header.
This is illustrated in the local experiments notebook.
As an alternative you can choose to run experiments at the service mesh level if you use one of the popular service meshes that allow header based routing in traffic splits. For further discussion see here.
Core 2 natively supports scaling based on Inference Lag, meaning the difference between incoming and outgoing requests for a model in a given period of time. This is done by configuring minReplicas or maxReplicas in the Model CRDs and making sure you configure the Core 2 install with the autoscaling.autoscalingModelEnabled helm value set to true (default is false).
Users can expose custom or Kubernetes-native metrics, and then target the scaling of models based on those metrics by using HorizontalPodAutoscaler. This requires exposing the right metrics, using the monitoring tool of your choice (e.g. Prometheus).
Once the approach for Model scaling is implemented, Server scaling needs to be configured.
Implement Server Scaling by either:
Enabling Autoscaling of Servers based on Model needs. This is managed by Seldon's scheduler, and is enabled by setting minReplicas and maxReplicas in the Server Custom Resource and making sure you configure the Core 2 install with the autoscaling.autoscalingServerEnabled helm value set to true (the default)
If Models and Servers are to have a one-to-one mapping (no Multi-Model Serving) then users can also define scaling of Servers using an HPA manifest that matches the HPA applied to the associated Models. This approach is outlined . This approach will only work with custom metrics, as Kubernetes does not allow mutliple HPAs to target the same metrics from Kubernetes directly.
Based on the requirements above, one of the following three options for coordinated autoscaling of Models and Servers can be chosen:
Inference lag
✅
- Simplest Implementation - One metric across models
- Model scale down only when no inference traffic exists to that model
User-defined (HPA)
✅
- Custom scaling metric
- Requires Metrics store integration (e.g. Prometheus) - Potentially suboptimal Server packing on scale down
Alternatively, the following decision-tree showcases the approaches we recommend based on users' requirements:
When running Core 2 at scale, it is important to understand the scaling behaviour of Seldon's services as well as the scaling of the Models and Servers themselves. This is outlined in the Scaling Core Services page.
All CRD changes maintain backward compatibility with existing CRs. We introduce new Core 2 scaling configuration options in SeldonConfig (config.ScalingConfig.*), with a wider goal of centralising Core 2 configuration and allowing for configuration changes after the Core 2 cluster is deployed. To ensure a smooth transition, some of the configuration options will only take effect starting from the next releases, but end-users are encouraged to set them to the desired values before upgrading to the next release (2.11).
Upgrading when using helm is seamless, with existing helm values being used to fill in new configuration options. If not using helm, previous SeldonConfig CRs remain valid, but restrictive defaults will be used for the scaling configuration. One parameter in particular, maxShardCountMultiplier docs will need to be set in order to take advantage of the new pipeline scalability features. This parameter can be changed and the effects of its value will be propagated to all components that use the config.
For full release notes, see .
Though there are no breaking changes between 2.8 and 2.9, there are some new functionalties offered that require changes to fields in our CRDs:
In Core 2.9 you can now set minReplicas to enable of Models. This means that users will no longer have to wait for the full set of desired replicas before loading models onto servers (e.g. when scaling up).
We've also added a spec.llm field to the Model CRD . The field is used by the PromptRuntime in Seldon's to reference a LLM model. Only one of spec.llm and spec.explainer should be set at a given time. This allows the deployment of multiple "models" acting as prompt generators for the same LLM.
Due to the introduction of Server-autoscaling, it is important to understand what type of autoscaling you want to leverage, and how that can be configured. Below are configuratation that help set autoscaling behaviour. All options here have corresponding command-line arguments that can be passed to seldon-scheduler when not using helm as the install method. The following helm values can be set
Core 2.8 introduces several new fields in our CRDs:
statefulSetPersistentVolumeClaimRetentionPolicy enables users to configure the cleaning of PVC on their servers. This field is set to retain as default.
Status.selector was introduced as a mandatory field for models in 2.8.4 and made optional in 2.8.5. This field enables autoscaling with HPA.
PodSpec in the OverrideSpec
These added fields do not result in breaking changes, apart from 2.8.4 which required the setting of the Status.selector upon upgrading. This field was however changed to optional in the subsequent 2.8.5 release. Updating the CRDs (e.g. via helm) will enable users to benefit from the associated functionality.
All pods provisioned through the operator i.e. SeldonRuntime and Server resources now have the
label app.kubernetes.io/name for identifying the pods.
Previously, the labelling has been inconsistent across different versions of Seldon Core 2, with
mixture of app and app.kubernetes.io/name used.
If using the Prometheus operator ("Kube Prometheus"), please apply the v2.7.0 manifests for Seldon Core 2 according to the .
Note that these manifests need to be adjusted to discover metrics endpoints based on the existing setup.
If previous pod monitors had namespaceSelector fields set, these should be copied over and applied
to the new manifests.
If namespaces do not matter, cluster-wide metrics endpoint discovery can be setup by modifying thenamespaceSelector field in the pod monitors:
Release 2.6 brings with it new custom resources SeldonConfig and SeldonRuntime, which provide
a new way to install Seldon Core 2 in Kubernetes. Upgrading in the same namespace will cause downtime
while the pods are being recreated. Alternatively users can have an external service mesh or other
means to be used over multiple namespaces to bring up the system in a new namespace and redeploy models
before switch traffic between them.
If the new 2.6 charts are used to upgrade in an existing namespace models will eventually be redeloyed but there will be service downtime as the core components are redeployed.
Learn how to verify your Seldon Core installation by running tests and checking component functionality.
To confirm the successful installation of Seldon Core 2, Kafka, and the service mesh, deploy a sample model and perform an inference test. Follow these steps:
Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:
The output is:
Verify that the model is deployed in the namespace seldon-mesh.
When the model is deployed, the output is similar to:
Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:
The output is:
Verify that the pipeline is deployed in the namespace seldon-mesh.
When the pipeline is deployed, the output is similar to:
Use curl to send a test inference request to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:
The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct model name.
The output is similar to:
Use curl to send a test inference request through the pipeline to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:
The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct pipeline name.
To route inference requests to a pipeline endpoint, include the .pipeline suffix in the model name within the request header. This distinguishes the pipeline from a model that shares the same base name.
The output is similar to:
Learn how to configure and manage inference servers in Seldon Core 2, including MLServer and Triton server farms, model scheduling, and capability management.
By default Seldon installs two server farms using MLServer and Triton with 1 replica each. Models are scheduled onto servers based on the server's resources and whether the capabilities of the server matches the requirements specified in the Model request. For example:
This model specifies the requirement sklearn
There is a default capabilities for each server as follows:
MLServer
Triton
kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"helm repo add seldon-charts https://seldonio.github.io/helm-charts/
helm repo update seldon-chartshelm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
--namespace default \
--install helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace seldon-mesh --set controller.clusterwide=true \
--installcurl http://${MESH_IP}/v2/models/experiment-iris/infer \
-H "Content-Type: application/json" \
-H "seldon-model: experiment-iris.experiment" \
-d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'kubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
EOF

User-defined (HPA)
❌
- Coordinated Model and Server scaling
- Requires Metrics store integration (e.g. Prometheus) - No Multi-Model Serving

autoscaling.autoscalingModelEnabled, with corresponding cmd line arg: --enable-model-autoscaling (defaults to false): enable or disable native model autoscaling based on lag thresholds. Enabling this assumes that lag (number of inference requests "in-flight") is a representative metric based on which to scale your models in a way that makes efficient use of resources.
autoscaling.autoscalingServerEnabled with corresponding cmd line arg: --enable-server-autoscaling (defaults to "true"): enable to use native server autoscaling, where the number of server replicas is set according to the number of replicas required by the models loaded onto that server.
autoscaling.serverPackingEnabled with corresponding cmd line arg: --server-packing-enabled (experimental, defaults to "false"): enable server packing to try and reduce the number of server replicas on model scale-down.
autoscaling.serverPackingPercentage with corresponding cmd line arg: --server-packing-percentage (experimental, defaults to "0.0"): controls the percentage of model replica removals (due to model scale-down or deletion) that should trigger packing
Model 1 adds 10 to input, Model 2 multiples by 10 the input. The structure of the artifact repo is shown below:
Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.
Ensure that you are performing these steps in the directory where you have downloaded the samples.
Get the IP address of the Seldon Core 2 instance running with Istio:
Servers can be defined with a capabilities field to indicate custom configurations (e.g. Python dependencies). For instance:
These capabilities override the ones from the serverConfig: mlserver. A model that takes advantage of this is shown below:
This above model will be matched with the previous custom server mlserver-134.
Servers can also be set up with the extraCapabilities that add to existing capabilities from the referenced ServerConfig. For instance:
This server, mlserver-extra, inherits a default set of capabilities via serverConfig: mlserver.
These defaults are discussed above.
The extraCapabilities are appended to these to create a single list of capabilities for this server.
Models can then specify requirements to select a server that satisfies those requirements as follows.
The capabilities field takes precedence over the extraCapabilities field.
For some examples see here.
Within docker we don't support this but for Kubernetes see here
for ns in op-ns1 op-ns2 ns1 ns2 ns3 ns4; do kubectl create ns "$ns"; donehelm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace op-ns1 \
--set controller.clusterwide=true \
--set "controller.watchNamespaces={ns1,ns2}" \
--installhelm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace op-ns2 \
--set controller.clusterwide=true \
--set "controller.watchNamespaces={ns3,ns4}" \
--set controller.skipClusterRoleCreation=true \
--installhelm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
--install helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
--namespace seldon-mesh \
--install kubectl get pods -n seldon-meshNAME READY STATUS RESTARTS AGE
hodometer-749d7c6875-4d4vw 1/1 Running 0 4m33s
mlserver-0 3/3 Running 0 4m10s
seldon-dataflow-engine-7b98c76d67-v2ztq 0/1 CrashLoopBackOff 5 (49s ago) 4m33s
seldon-envoy-bb99f6c6b-4mpjd 1/1 Running 0 4m33s
seldon-modelgateway-5c76c7695b-bhfj5 1/1 Running 0 4m34s
seldon-pipelinegateway-584c7d95c-bs8c9 1/1 Running 0 4m34s
seldon-scheduler-0 1/1 Running 0 4m34s
seldon-v2-controller-manager-5dd676c7b7-xq5sm 1/1 Running 0 4m52s
triton-0 2/3 Running 0 4m10sspec:
namespaceSelector:
any: truemodel.mlops.seldon.io/iris createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/iris condition metkubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: irispipeline
spec:
steps:
- name: iris
output:
steps:
- iris
EOFpipeline.mlops.seldon.io/irispipeline createdkubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/irispipeline condition metcurl -k http://<INGRESS_IP>:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: iris" \
-d '{
"inputs": [
{
"name": "predict",
"shape": [1, 4],
"datatype": "FP32",
"data": [[1, 2, 3, 4]]
}
]
}'{"model_name":"iris_1","model_version":"1","id":"f4d8b82f-2af3-44fb-b115-60a269cbfa5e","parameters":{},"outputs":[{"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"},"data":[2]}]}curl -k http://<INGRESS_IP>:80/v2/models/irispipeline/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: irispipeline.pipeline" \
-d '{
"inputs": [
{
"name": "predict",
"shape": [1, 4],
"datatype": "FP32",
"data": [[1, 2, 3, 4]]
}
]
}'{"model_name":"","outputs":[{"data":[2],"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"}}]}curl -k http://<INGRESS_IP>:80/v2/models/math/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: math" \
-d '{
"model_name": "math",
"inputs": [
{
"name": "INPUT",
"datatype": "FP32",
"shape": [4],
"data": [1, 2, 3, 4]
}
]
}' | jq{
"model_name": "math_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
4
],
"data": [
11.0,
12.0,
13.0,
14.0
]
}
]
}
seldon model infer math --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"math","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"modelName": "math_1",
"modelVersion": "1",
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
11,
12,
13,
14
]
}
}
]
}
curl -k http://<INGRESS_IP>:80/v2/models/math/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: math" \
-d '{
"model_name": "math",
"inputs": [
{
"name": "INPUT",
"datatype": "FP32",
"shape": [4],
"data": [1, 2, 3, 4]
}
]
}' | jq{
"model_name": "math_2",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
4
],
"data": [
10.0,
20.0,
30.0,
40.0
]
}
]
}
seldon model infer math --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"math","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"modelName": "math_2",
"modelVersion": "1",
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
10,
20,
30,
40
]
}
}
]
}
config.pbtxt
1/model.py <add 10>
2/model.py <mul 10>
ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"cat ./models/multi-version-1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: math
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/multi-version"
artifactVersion: 1
requirements:
- triton
- python
kubectl apply -f ./models/multi-version-1.yaml -n seldon-meshmodel.mlops.seldon.io/math created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/math condition met
cat ./models/multi-version-2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: math
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/multi-version"
artifactVersion: 2
requirements:
- triton
- python
kubectl apply -f ./models/multi-version-2.yaml -n seldon-meshmodel.mlops.seldon.io/math configured
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/math condition met
kubectl delete -f ./models/multi-version-1.yaml -n seldon-meshmodel.mlops.seldon.io "math" deleted
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
requirements:
- sklearn
memory: 100Ki- name: SELDON_SERVER_CAPABILITIES
value: "mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost"- name: SELDON_SERVER_CAPABILITIES
value: "triton,dali,fil,onnx,openvino,python,pytorch,tensorflow,tensorrt"apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-134
spec:
serverConfig: mlserver
capabilities:
- mlserver-1.3.4
podSpec:
containers:
- image: seldonio/mlserver:1.3.4
name: mlserverapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- mlserver-1.3.4apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-extra
spec:
serverConfig: mlserver
extraCapabilities:
- extraapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: extra-model-requirements
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- extraLearn how to implement HPA-based autoscaling for both Models and Servers in single-model serving deployments
This page describes how to implement autoscaling for both Models and Servers using Kubernetes HPA (Horizontal Pod Autoscaler) in a single-model serving setup. This approach is specifically designed for deployments where each Server hosts exactly one Model replica (1:1 mapping between Models and Servers).
In single-model serving deployments, you can use HPA to scale both Models and their associated Servers independently, while ensuring they scale in a coordinated manner. This is achieved by:
Setting up HPA for Models based on custom metrics (e.g., RPS)
Setting up matching HPA configurations for Servers
Ensuring both HPAs target the same metrics and scaling policies
Only custom metrics from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.
No Multi-Model Serving - this approach requires a 1-1 mapping of Models and Servers, meaning that consolidation of multiple Models onto shared Servers (Multi-Model Serving) is not possible.
In order to set up HPA Autoscaling for Models and Servers together, metrics need to be exposed in the same way as is explained in the tutorial. If metrics have not yet been exposed, follow that workflow until you are at the point where you will configure and apply the HPA manifests.
In this implementation, Models and Servers will be configured to autoscale with a separate HPA manifest targetting the same metric as Models. In order for the scaling metrics to be in sync, it's important to apply the HPA manifests simultaneously. It is important to keep both the scaling metric and any scaling policies the same across the two HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals.
Here's an example configuration, utilizing the same infer_rps metric as was set up in the .
In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas and maxReplicas, as well as any other configured scaling policies should be kept in sync across the HPA for the model and the server.
Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds). As a side effect of this, creating the Model HPA and the Server HPA (for a given model) at different times will mean that the scaling decisions on the two are taken at different times. Even when creating the two CRs together as part of the same manifest, there will usually be a small delay between the point where the Model and Server spec.replicas values are changed. Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).
When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value]/[target]. This information is updated in accordance to the sampling rate of each HPA resource. It is therefore expected to sometimes see different metric values for the Model and its corresponding Server.
In this implementation, the scheduler itself does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady condition transitions to Status: False with a ScheduleFailed message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.
The following elements are important to take into account when setting the HPA policies.
The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:
The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds) is not recommended because of this.
Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway.
The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
Load the model
kubectl apply -f ./models/hf-text-gen.yamlmodel.mlops.seldon.io/text-gen createdseldon model load -f ./models/hf-text-gen.yaml{}Wait for the model to be ready
kubectl get model text-gen -n ${NAMESPACE} -o json | jq -r '.status.conditions[] | select(.message == "ModelAvailable") | .status'Trueseldon model status text-gen -w ModelAvailable | jq -M .{}Do a REST inference call
Unload the model
Load the model
Unload the model
Learn how to implement outlier detection in Seldon Core using Alibi-Detect integration for model monitoring and anomaly detection.
cat ./models/hf-text-gen.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: text-gen
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/text-generation"
requirements:
- huggingface
kubectl get --rawThe duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.
Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one

curl --location 'http://${MESH_IP}:9000/v2/models/text-gen/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'{
"model_name": "text-gen_1",
"model_version": "1",
"id": "121ff5f4-1d4a-46d0-9a5e-4cd3b11040df",
"parameters": {},
"outputs": [
{
"name": "output",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "hg_jsonlist"
},
"data": [
"{\"generated_text\": \"Once upon a time in a galaxy far away, the planet is full of strange little creatures. A very strange combination of creatures in that universe, that is. A strange combination of creatures in that universe, that is. A kind of creature that is\"}"
]
}
]
}
# samples/models/cifar10-outlier-detect.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-outlier
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/outlier-detector"
requirements:
- mlserver
- alibi-detectcapabilitiesThe extraCapabilities field extends the existing list from the ServerConfig.
import os
os.environ["NAMESPACE"] = "seldon-mesh"MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP=MESH_IP[0]
import os
os.environ['MESH_IP'] = MESH_IP
MESH_IP
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: irisa0-model-hpa
namespace: seldon-mesh
spec:
scaleTargetRef:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
minReplicas: 1
maxReplicas: 3
metrics:
- type: Object
object:
metric:
name: infer_rps
describedObject:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
target:
type: AverageValue
averageValue: 3
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mlserver-server-hpa
namespace: seldon-mesh
spec:
scaleTargetRef:
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
name: mlserver
minReplicas: 1
maxReplicas: 3
metrics:
- type: Object
object:
metric:
name: infer_rps
describedObject:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
target:
type: AverageValue
averageValue: 3apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: irisa0-model-hpa
namespace: seldon-mesh
spec:
scaleTargetRef:
...
minReplicas: 1
maxReplicas: 3
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 5
periodSeconds: 90
metrics:
...seldon model infer text-gen \
'{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'{
"model_name": "text-gen_1",
"model_version": "1",
"id": "121ff5f4-1d4a-46d0-9a5e-4cd3b11040df",
"parameters": {},
"outputs": [
{
"name": "output",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "hg_jsonlist"
},
"data": [
"{\"generated_text\": \"Once upon a time in a galaxy far away, the planet is full of strange little creatures. A very strange combination of creatures in that universe, that is. A strange combination of creatures in that universe, that is. A kind of creature that is\"}"
]
}
]
}
res = !seldon model infer text-gen --inference-mode grpc \
'{"inputs":[{"name":"args","contents":{"bytes_contents":["T25jZSB1cG9uIGEgdGltZSBpbiBhIGdhbGF4eSBmYXIgYXdheQo="]},"datatype":"BYTES","shape":[1]}]}'import json
import base64
r = json.loads(res[0])
base64.b64decode(r["outputs"][0]["contents"]["bytesContents"][0])b'{"generated_text": "Once upon a time in a galaxy far away\\n\\nThe Universe is a big and massive place. How can you feel any of this? Your body doesn\'t make sense if the Universe is in full swing \\u2014 you don\'t have to remember whether the"}'
kubectl delete model text-genseldon model unload text-gencat ./models/hf-text-gen-custom-tiny-stories.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: custom-tiny-stories-text-gen
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/huggingface-text-gen-custom-tiny-stories"
requirements:
- huggingface
kubectl apply -f ./models/hf-text-gen-custom-tiny-stories.yamlmodel.mlops.seldon.io/custom-tiny-stories-text-gen createdkubectl get model custom-tiny-stories-text-gen -n ${NAMESPACE} -o json | jq -r '.status.conditions[] | select(.message == "ModelAvailable") | .status'Truecurl --location 'http://${MESH_IP}:9000/v2/models/custom-tiny-stories-text-gen/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'{
"model_name": "custom-tiny-stories-text-gen_1",
"model_version": "1",
"id": "d0fce59c-76e2-4f81-9711-1c93d08bcbf9",
"parameters": {},
"outputs": [
{
"name": "output",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "hg_jsonlist"
},
"data": [
"{\"generated_text\": \"Once upon a time in a galaxy far away. It was a very special place to live.\\n\"}"
]
}
]
}
seldon model load -f ./models/hf-text-gen-custom-tiny-stories.yaml{}
seldon model status custom-tiny-stories-text-gen -w ModelAvailable | jq -M .{}
seldon model infer custom-tiny-stories-text-gen \
'{"inputs": [{"name": "args","shape": [1],"datatype": "BYTES","data": ["Once upon a time in a galaxy far away"]}]}'{
"model_name": "custom-tiny-stories-text-gen_1",
"model_version": "1",
"id": "d0fce59c-76e2-4f81-9711-1c93d08bcbf9",
"parameters": {},
"outputs": [
{
"name": "output",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "hg_jsonlist"
},
"data": [
"{\"generated_text\": \"Once upon a time in a galaxy far away. It was a very special place to live.\\n\"}"
]
}
]
}
res = !seldon model infer custom-tiny-stories-text-gen --inference-mode grpc \
'{"inputs":[{"name":"args","contents":{"bytes_contents":["T25jZSB1cG9uIGEgdGltZSBpbiBhIGdhbGF4eSBmYXIgYXdheQo="]},"datatype":"BYTES","shape":[1]}]}'import json
import base64
r = json.loads(res[0])
base64.b64decode(r["outputs"][0]["contents"]["bytesContents"][0])b'{"generated_text": "Once upon a time in a galaxy far away\\nOne night, a little girl named Lily went to"}'
kubectl delete custom-tiny-stories-text-genseldon model unload custom-tiny-stories-text-genAs a next step, why not try running a larger-scale model? You can find a definition for one in ./models/hf-text-gen-custom-gpt2.yaml. However, you may need to request and allocate more memory!'172.18.255.2'
cat ./servers/custom-mlserver-capabilities.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-134
spec:
serverConfig: mlserver
capabilities:
- mlserver-1.3.4
podSpec:
containers:
- image: seldonio/mlserver:1.3.4
name: mlserver
kubectl create -f ./servers/custom-mlserver-capabilities.yaml -n ${NAMESPACE}server.mlops.seldon.io/mlserver-134 created
kubectl wait --for condition=ready --timeout=300s server --all -n ${NAMESPACE}server.mlops.seldon.io/mlserver condition met
server.mlops.seldon.io/mlserver-134 condition met
server.mlops.seldon.io/triton condition met
cat ./models/iris-custom-requirements.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- mlserver-1.3.4
kubectl create -f ./models/iris-custom-requirements.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris created
kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}model.mlops.seldon.io/iris condition met
seldon model infer iris --inference-host ${MESH_IP}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "057ae95c-e6bc-4f57-babf-0817ff171729",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
kubectl delete -f ./models/iris-custom-server.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deleted
kubectl delete -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}server.mlops.seldon.io "mlserver-134" deleted
cat ./servers/custom-mlserver.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-134
spec:
serverConfig: mlserver
extraCapabilities:
- mlserver-1.3.4
podSpec:
containers:
- image: seldonio/mlserver:1.3.4
name: mlserver
kubectl create -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}server.mlops.seldon.io/mlserver-134 created
kubectl wait --for condition=ready --timeout=300s server --all -n ${NAMESPACE}server.mlops.seldon.io/mlserver condition met
server.mlops.seldon.io/mlserver-134 condition met
server.mlops.seldon.io/triton condition met
cat ./models/iris-custom-server.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
server: mlserver-134
kubectl create -f ./models/iris-custom-server.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris created
kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}model.mlops.seldon.io/iris condition met
seldon model infer iris --inference-host ${MESH_IP}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "a3e17c6c-ee3f-4a51-b890-6fb16385a757",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
kubectl delete -f ./models/iris-custom-server.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deleted
kubectl delete -f ./servers/custom-mlserver.yaml -n ${NAMESPACE}server.mlops.seldon.io "mlserver-134" deleted
Learn how to configure storage secrets in Seldon Core 2 for secure model artifact access using Rclone, including AWS S3, GCS, and MinIO integration.
Inference artifacts referenced by Models can be stored in any of the storage backends supported by Rclone. This includes local filesystems, AWS S3, and Google Cloud Storage (GCS), among others. Configuration is provided out-of-the-box for public GCS buckets, which enables the use of Seldon-provided models like in the below example:
This configuration is provided by the Kubernetes Secret seldon-rclone-gs-public.
It is made available to Servers as a preloaded secret.
You can define and use your own storage configurations in exactly the same way.
To define a new storage configuration, you need the following details:
Remote name
Remote type
Provider parameters
A remote is what Rclone calls a storage location.
The type defines what protocol Rclone should use to talk to this remote.
A provider is a particular implementation for that storage type.
Some storage types have multiple providers, such as s3 having AWS S3 itself, MinIO, Ceph, and so on.
The remote name is your choice.
The prefix you use for models in spec.storageUri must be the same as this remote name.
The remote type is one of the values .
For example, for AWS S3 it is s3 and for Dropbox it is dropbox.
The provider parameters depend entirely on the remote type and the specific provider you are using.
Please check the Rclone documentation for the appropriate provider.
Note that Rclone docs for storage types call the parameters properties and provide both config and env var formats--you need to use the config format.
For example, the GCS parameter --gcs-client-id described should be used as client_id.
For reference, this format is described in the .
Note that we do not support the use of opts discussed in that section.
Kubernetes Secrets are used to store Rclone configurations, or storage secrets, for use by Servers. Each Secret should contain exactly one Rclone configuration.
A Server can use storage secrets in one of two ways:
It can dynamically load a secret specified by a Model in its .spec.secretName
It can use global configurations made available via
The name of a Secret is entirely your choice, as is the name of the data key in that Secret. All that matters is that there is a single data key and that its value is in the format described above.
Rather than Models always having to specify which secret to use, a Server can load storage secrets ahead of time. These can then be reused across many Models.
When using a preloaded secret, the Model definition should leave .spec.secretName empty.
The protocol prefix in .spec.storageUri still needs to match the remote name specified by a storage secret.
The secrets to preload are named in a centralised ConfigMap called seldon-agent.
This ConfigMap applies to all Servers managed by the same SeldonRuntime.
By default this ConfigMap only includes seldon-rclone-gs-public, but can be extended with your own secrets as shown below:
The easiest way to change this is to update your SeldonRuntime.
If your SeldonRuntime is configured using the seldon-core-v2-runtime Helm chart, the corresponding value is config.agentConfig.rclone.configSecrets.
This can be used as shown below:
Otherwise, if your SeldonRuntime is configured directly, you can add secrets by setting .spec.config.agentConfig.rclone.config_secrets.
This can be used as follows:
Assuming you have installed MinIO in the minio-system namespace, a corresponding secret could be:
You can then reference this in a Model with .spec.secretName:
can use for access. You can generate the credentials for a service account using the :
The contents of gcloud-application-credentials.json can be put into a secret:
You can then reference this in a Model with .spec.secretName:
To run this example in Kind we need to start Kind with access to a local folder where are models are location. In this example it is a folder in /tmp and associate that with a path in the container.
To start a Kind cluster see, .
Create the local folder for models and copy an example iris sklearn model to it.
Create a storage class and associated persistent colume referencing the /models folder where models are stored.
Now create a new Server based on the provided MLServer configuration but extend it with our PVC by adding this to the rclone container which will allow rclone to move models from this PVC onto the server.
We also add a new capability pvc to allow us to schedule models to this server that has the PVC.
Use a simple sklearn iris classification model with the added pvc requirement so that MLServer with the PVC is targeted during scheduling.
Do a gRPC inference call
Learn how to set up and configure Kafka for Seldon Core in production environments, including cluster setup and security configuration.
Kafka is a component in the Seldon Core 2 ecosystem, that provides scalable, reliable, and flexible communication for machine learning deployments. It serves as a strong backbone for building complex inference pipelines, managing high-throughput asynchronous predictions, and seamlessly integrating with event-driven systems—key features needed for contemporary enterprise-grade ML platforms.
An inference request is a request sent to a machine learning model to make a prediction or inference based on input data. It is a core concept in deploying machine learning models in production, where models serve predictions to users or systems in real-time or batch mode.
To explore this feature of Seldon Core 2, you need to integrate with Kafka. Integrate Kafka through managed cloud services or by deploying it directly within a Kubernetes cluster.
provides more information about the encrytion and authentication.
provides the steps to configure some of the managed Kafka services.
Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. To install Kafka for testing purposed in your Kubernetes cluster, use . For more information, see
Learn more about using taints and tolerations with node affinity or node selector to allocate resources in a Kubernetes cluster.
When deploying machine learning models in Kubernetes, you may need to control which infrastructure resources these models use. This is especially important in environments where certain workloads, such as resource-intensive models, should be isolated from others or where specific hardware such as GPUs, needs to be dedicated to particular tasks. Without fine-grained control over workload placement, models might end up running on suboptimal nodes, leading to inefficiencies or resource contention.
For example, you may want to:
Isolate inference workloads from control plane components or other services to prevent resource contention.
Ensure that GPU nodes are reserved exclusively for models that require hardware acceleration.
Keep business-critical models on dedicated nodes to ensure performance and reliability.
Run external dependencies like Kafka on separate nodes to avoid interference with inference workloads.
To solve these problems, Kubernetes provides mechanisms such as taints, tolerations, and nodeAffinity or nodeSelector to control resource allocation and workload scheduling.
are applied to nodes and to Pods to control which Pods can be scheduled on specific nodes within the Kubernetes cluster. Pods without a matching toleration for a node’s taint are not scheduled on that node. For instance, if a node has GPUs or other specialized hardware, you can prevent Pods that don’t need these resources from running on that node to avoid unnecessary resource usage.
When used together, taints and tolerations with nodeAffinity or nodeSelector can effectively allocate certain Pods to specific nodes, while preventing other Pods from being scheduled on those nodes.
In a Kubernetes cluster running Seldon Core 2, this involves two key configurations:
Configuring servers to run on specific nodes using mechanisms like taints, tolerations, and nodeAffinity or nodeSelector.
Configuring models so that they are scheduled and loaded on the appropriate servers.
This ensures that models are deployed on the optimal infrastructure and servers that meet their requirements.
Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the Open Inference Protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:
This notebook will show how we can update running experiments.
We will use three SKlearn Iris classification models to illustrate experiment updates.
Load all models.
Let's call all three models individually first.
We will start an experiment to change the iris endpoint to split traffic with the iris2 model.
Learn how to configure Seldon Core installation components using SeldonConfig resource, including component specifications, Kafka settings, and tracing configuration.
The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.
The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).
Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.
Learn how to implement drift detection in Seldon Core 2 using Alibi-Detect integration for model monitoring and batch processing.
Drift detection models are treated as any other Model. You can run any saved drift detection model by
adding the requirement alibi-detect.
An example drift detection model from the CIFAR10 image classification example is shown below:
Usually you would run these models in an asynchronous part of a Pipeline, i.e. they are not connected to the output of the Pipeline which defines the synchronous path. For example, the CIFAR-10 image detection example uses a pipeline as shown below:
Note how the cifar10-drift model is not part of the path to the outputs. Drift alerts can be
read from the Kafka topic of the model.
When tuning performance for pipelines, reducing the overhead of Core 2 components responsible for data-processing within a pipeline is another aspect to consider. In Core 2, four components influence this overhead:
pipelinegateway which handles pipeline requests
modelgateway which sends requests to model inference servers
# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
requirements:
- sklearn
memory: 100Kiimport osos.environ["NAMESPACE"] = "seldon-mesh"MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP=MESH_IP[0]
import os
os.environ['MESH_IP'] = MESH_IP
MESH_IP'172.19.255.1'dataflow-engine runs Kafka KStream topologies to manage data streamed between models in a pipelinethe Kafka cluster
Given that Core 2 uses Kafka as the messaging system to communicate data between models, lowering the network latency between Core 2 and Kafka (especially when the Kafka installation is in a separate cluster to Core) will improve pipeline performance.
Additionally, the number of Kafka partitions per topic (which must be fixed across all models in a pipeline) significantly influences
Kafka’s maximum throughput, and
the effective number of replicas for pipelinegateway, dataflow-engine and modelgateway
As a baseline for serving a high inferencing RPS across multiple pipelines, we recommend using as many replicas of pipelinegateway and dataflow-engine as you have Kafka topic partitions in order to leverage the balanced distribution of inference traffic. In this case, each dataflow-engine will process the data from one partition, across all pipelines. Increasing the number of dataflow-engine replicas further starts sharding the pipelines across the available replicas, with each pipeline being processed by maxShardCountMultiplier replicas (see detailed pipeline scalability docs for configuration details)
Similarly, modelgateway can handle more throughput if its number of workers and number of replicas is increased. modelgateway has two scalability parameters that can be set via environment variables:
MODELGATEWAY_NUM_WORKERS
MODELGATEWAY_MAX_NUM_CONSUMERS
Each model within a Kubernetes namespace is consistently assigned to one modelgateway consumer (based on their index in a hash table of size MODELGATEWAY_MAX_NUM_CONSUMERS); The size of the hash table influences how many models will share the same consumer.
For each consumer, a MODELGATEWAY_NUM_WORKERS number of lightweight inference workers (goroutines) are created to forward requests to the inference servers and wait for responses.
Increasing these parameters (starting with an increase in the number of workers) will improve throughput if the modelgateway pod has enough resources to support more workers and consumers.
%env INFER_ENDPOINT=0.0.0.0:9000env: INFER_ENDPOINT=0.0.0.0:9000
cat ./models/tfsimple1.yaml# samples/models/cifar10-drift-detect.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-drift
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/drift-detector"
requirements:
- mlserver
- alibi-detectModels will still be ready even though Pipeline terminated
Now we update the experiment to change to a split with the iris3 model.
Now we should see a split with the iris3 model.
Now the experiment has been stopped we check everything as before.
Here we test changing the model we want to split traffic on. We will use three SKlearn Iris classification models to illustrate.
Let's call all three models to verify initial conditions.
Now we start an experiment to change calls to the iris model to split with the iris2 model.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Now let's change the model we want to experiment to modify to the iris3 model. Splitting between that and iris2.
Let's check the iris model is now as before but the iris3 model has traffic split.
Finally, let's check now the experiment has stopped as is as at the start.
seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml
seldon model load -f ./models/sklearn3.yaml{}
{}
{}
# samples/auth/agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: seldon-agent
data:
agent.json: |-
{
"rclone" : {
"config_secrets": ["seldon-rclone-gs-public","minio-secret"]
},
}config:
agentConfig:
rclone:
configSecrets:
- my-s3
- custom-gcs
- minio-in-clusterapiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
name: seldon
spec:
seldonConfig: default
config:
agentConfig:
rclone:
config_secrets:
- my-s3
- custom-gcs
- minio-in-cluster
...# samples/auth/minio-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: minio-secret
namespace: seldon-mesh
type: Opaque
stringData:
s3: |
type: s3
name: s3
parameters:
provider: minio
env_auth: false
access_key_id: minioadmin
secret_access_key: minioadmin
endpoint: http://minio.minio-system:9000# samples/models/sklearn-iris-minio.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "s3://models/iris"
secretName: "minio-secret"
requirements:
- sklearngcloud iam service-accounts keys create \
gcloud-application-credentials.json \
--iam-account [SERVICE-ACCOUNT--NAME]@[PROJECT-ID].iam.gserviceaccount.comapiVersion: v1
kind: Secret
metadata:
name: gcs-bucket
type: Opaque
stringData:
gcs: |
type: gcs
name: gcs
parameters:
service_account_credentials: '<gcloud-application-credentials.json>'apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mymodel
spec:
storageUri: "gcs://my-bucket/my-path/my-pytorch-model"
secretName: "gcs-bucket"
requirements:
- pytorchapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
seldon model load -f ./models/tfsimple1.yaml{}
seldon model status tfsimple1 -w ModelAvailable | jq -M .{}
seldon model infer tfsimple1 --inference-host ${INFER_ENDPOINT} \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
}
seldon model infer tfsimple1 --inference-mode grpc --inference-host ${INFER_ENDPOINT} \
'{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'{"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
curl http://${INFER_ENDPOINT}/v2/models/tfsimple1/infer -H "Content-Type: application/json" -H "seldon-model: tfsimple1" \
-d '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{"model_name":"tfsimple1_1","model_version":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":[1,16],"data":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]},{"name":"OUTPUT1","datatype":"INT32","shape":[1,16],"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}]}
grpcurl -d '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimple1 \
${INFER_ENDPOINT} inference.GRPCInferenceService/ModelInfer{
"modelName": "tfsimple1_1",
"modelVersion": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
]
}
],
"rawOutputContents": [
"AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="
]
}
cat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
seldon pipeline load -f ./pipelines/tfsimple.yaml{}
seldon pipeline status tfsimple -w PipelineReady{"pipelineName":"tfsimple","versions":[{"pipeline":{"name":"tfsimple","uid":"cg5fm6c6dpcs73c4qhe0","version":1,"steps":[{"name":"tfsimple1"}],"output":{"steps":["tfsimple1.outputs"]},"kubernetesMeta":{}},"state":{"pipelineVersion":1,"status":"PipelineReady","reason":"created pipeline","lastChangeTimestamp":"2023-03-10T09:40:41.317797761Z","modelsReady":true}}]}
seldon pipeline infer tfsimple --inference-host ${INFER_ENDPOINT} \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline infer tfsimple --inference-mode grpc --inference-host ${INFER_ENDPOINT} \
'{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
curl http://${INFER_ENDPOINT}/v2/models/tfsimple1/infer -H "Content-Type: application/json" -H "seldon-model: tfsimple.pipeline" \
-d '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{"model_name":"","outputs":[{"data":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32],"name":"OUTPUT0","shape":[1,16],"datatype":"INT32"},{"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"name":"OUTPUT1","shape":[1,16],"datatype":"INT32"}]}
grpcurl -d '{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimple.pipeline \
${INFER_ENDPOINT} inference.GRPCInferenceService/ModelInfer{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
]
}
],
"rawOutputContents": [
"AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="
]
}
seldon pipeline unload tfsimple
seldon model unload tfsimple1{}
{}
cat kind-config.yamlapiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
extraMounts:
- hostPath: /tmp/models
containerPath: /modelsmkdir -p /tmp/models
gsutil cp -r gs://seldon-models/mlserver/iris /tmp/modelscat pvc.yamlapiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path-immediate
provisioner: rancher.io/local-path
reclaimPolicy: Delete
mountOptions:
- debug
volumeBindingMode: Immediate
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: ml-models-pv
namespace: seldon-mesh
labels:
type: local
spec:
storageClassName: local-path-immediate
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/models"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ml-models-pvc
namespace: seldon-mesh
spec:
storageClassName: local-path-immediate
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
selector:
matchLabels:
type: localcat server.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-pvc
spec:
serverConfig: mlserver
extraCapabilities:
- "pvc"
podSpec:
volumes:
- name: models-pvc
persistentVolumeClaim:
claimName: ml-models-pvc
containers:
- name: rclone
volumeMounts:
- name: models-pvc
mountPath: /var/modelscat ./iris.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "/var/models/iris"
requirements:
- sklearn
- pvckubectl create -f iris.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris createdkubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}model.mlops.seldon.io/iris condition metkubectl get model iris -n ${NAMESPACE} -o jsonpath='{.status}' | jq -M .{
"conditions": [
{
"lastTransitionTime": "2022-12-24T11:04:37Z",
"status": "True",
"type": "ModelReady"
},
{
"lastTransitionTime": "2022-12-24T11:04:37Z",
"status": "True",
"type": "Ready"
}
],
"replicas": 1
}curl -k http://${MESH_IP}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [1,4],
"data": [[1,2,3,4]]
}
]
}' | jq -M .seldon model infer iris --inference-host ${MESH_IP}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "dc032bcc-3f4e-4395-a2e4-7c1e3ef56e9e",
"parameters": {
"content_type": null,
"headers": null
},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": null,
"data": [
2
]
}
]
}curl -k http://${MESH_IP}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"model_name": "iris",
"inputs": [
{
"name": "input",
"datatype": "FP32",
"shape": [1,4],
"data": [1,2,3,4]
}
]
}' | jq -M .seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP}:80 \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .{
"modelName": "iris_1",
"modelVersion": "1",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"contents": {
"int64Contents": [
"2"
]
}
}
]
}kubectl delete -f ./iris.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deleted# samples/pipelines/cifar10.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: cifar10-production
spec:
steps:
- name: cifar10
- name: cifar10-outlier
- name: cifar10-drift
batch:
size: 20
output:
steps:
- cifar10
- cifar10-outlier.outputs.is_outlierseldon model load -f ./models/tfsimple1.yaml
seldon model status tfsimple1 -w ModelAvailable{}
{}
kubectl apply -f ./models/tfsimple2.yaml -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model tfsimple2 -n ${NAMESPACE}model.mlops.seldon.io/tfsimple2 created
model.mlops.seldon.io/tfsimple2 condition metseldon model load -f ./models/tfsimple2.yaml
seldon model status tfsimple2 -w ModelAvailable | jq -M .{}
{}
%env INFER_REST_ENDPOINT=http://0.0.0.0:9000
%env INFER_GRPC_ENDPOINT=0.0.0.0:9000
%env SELDON_SCHEDULE_HOST=0.0.0.0:9004env: INFER_REST_ENDPOINT=http://0.0.0.0:9000
env: INFER_GRPC_ENDPOINT=0.0.0.0:9000
env: SELDON_SCHEDULE_HOST=0.0.0.0:9004
#%env INFER_REST_ENDPOINT=http://172.19.255.1:80
#%env INFER_GRPC_ENDPOINT=172.19.255.1:80
#%env SELDON_SCHEDULE_HOST=172.19.255.2:9004cat ./pipelines/tfsimples.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimples
spec:
steps:
- name: tfsimple1
- name: tfsimple2
inputs:
- tfsimple1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple1.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple2
curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReadyERROR:
Code: Unimplemented
Message:
kubectl apply -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}pipeline.mlops.seldon.io/tfsimples createdkubectl wait --for condition=ready --timeout=300s pipeline tfsimples -n ${NAMESPACE}seldon pipeline load -f ./pipelines/tfsimples.yaml
seldon pipeline status tfsimples -w PipelineReady{"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"ciepit2i8ufs73flaitg", "version":1, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:47:16.365934922Z"}}]}
seldon pipeline status tfsimples | jq .versions[0].state.modelsReadynull
curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady{
}
kubectl apply -f ./models/tfsimple1.yaml -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple1 condition metcurl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady{
}
curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady{
"ready": true
}
seldon pipeline status tfsimples | jq .versions[0].state.modelsReadytrue
seldon pipeline unload tfsimplescurl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReadyERROR:
Code: Unimplemented
Message:
seldon pipeline status tfsimples | jq .versions[0].state.modelsReadytrue
seldon pipeline load -f ./pipelines/tfsimples.yaml
seldon pipeline status tfsimples -w PipelineReady{"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"ciepj5qi8ufs73flaiu0", "version":1, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:47:51.626155116Z", "modelsReady":true}}]}
curl -Ik ${INFER_REST_ENDPOINT}/v2/pipelines/tfsimples/readygrpcurl -d '{"name":"tfsimples"}' \
-plaintext \
-import-path ../apis \
-proto ../apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:tfsimples.pipeline \
${INFER_GRPC_ENDPOINT} inference.GRPCInferenceService/ModelReady{
"ready": true
}
seldon pipeline status tfsimples | jq .versions[0].state.modelsReadytrue
seldon model unload tfsimple1
seldon model unload tfsimple2seldon pipeline status tfsimples | jq .versions[0].state.modelsReadynull
seldon pipeline unload tfsimplesimport os
os.environ["NAMESPACE"] = "seldon-mesh"MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP=MESH_IP[0]
import os
os.environ['MESH_IP'] = MESH_IP
MESH_IP'172.19.255.1'
kubectl create -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}pipeline.mlops.seldon.io/tfsimples created
kubectl wait --for condition=ready --timeout=1s pipeline --all -n ${NAMESPACE}error: timed out waiting for the condition on pipelines/tfsimples
kubectl get pipeline tfsimples -o jsonpath='{.status.conditions[0]}' -n ${NAMESPACE}{"lastTransitionTime":"2022-11-14T10:25:31Z","status":"False","type":"ModelsReady"}
kubectl create -f ./models/tfsimple1.yaml -n ${NAMESPACE}
kubectl create -f ./models/tfsimple2.yaml -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n ${NAMESPACE}pipeline.mlops.seldon.io/tfsimples condition met
kubectl get pipeline tfsimples -o jsonpath='{.status.conditions[0]}' -n ${NAMESPACE}{"lastTransitionTime":"2022-11-14T10:25:49Z","status":"True","type":"ModelsReady"}
kubectl delete -f ./models/tfsimple1.yaml -n ${NAMESPACE}
kubectl delete -f ./models/tfsimple2.yaml -n ${NAMESPACE}
kubectl delete -f ./pipelines/tfsimples.yaml -n ${NAMESPACE}model.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
pipeline.mlops.seldon.io "tfsimples" deleted
seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable
seldon model status iris3 -w ModelAvailable{}
{}
{}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
seldon model infer iris3 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris3_1::50]
cat ./experiments/ab-default-model.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
seldon experiment start -f ./experiments/ab-default-model.yaml{}
seldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 100 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::48 :iris_1::52]
cat ./experiments/ab-default-model2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris3
weight: 50
seldon experiment start -f ./experiments/ab-default-model2.yaml{}
seldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 100 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris3_1::42 :iris_1::58]
seldon experiment stop experiment-sample{}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
seldon model infer iris3 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris3_1::50]
seldon model unload iris
seldon model unload iris2
seldon model unload iris3{}
{}
{}
seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml
seldon model load -f ./models/sklearn3.yaml{}
{}
{}
seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable
seldon model status iris3 -w ModelAvailable{}
{}
{}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
seldon model infer iris3 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris3_1::50]
cat ./experiments/ab-default-model.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
seldon experiment start -f ./experiments/ab-default-model.yaml{}
seldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 100 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::51 :iris_1::49]
cat ./experiments/ab-default-model3.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris3
candidates:
- name: iris3
weight: 50
- name: iris2
weight: 50
seldon experiment start -f ./experiments/ab-default-model3.yaml{}
seldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris3 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::25 :iris3_1::25]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
seldon experiment stop experiment-sample{}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
seldon model infer iris3 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris3_1::50]
seldon model unload iris
seldon model unload iris2
seldon model unload iris3{}
{}
{}
The default configuration is shown below.
In order to understand how pipelines will behave in a production setting, it is first helpful to isolate testing to the individual models within the pipeline. By obtaining the maximum throughput for each model given the infrastructure each are running on (e.g. GPU specs) and their configurations (e.g. how many workers are used), users will gain a better understanding of which models might be limiting to the performance of the pipeline around it. Given the performance profiles of models within a pipeline, it is recommended to optimize for the desired performance of those individual models first (see here), ensuring they have the right number of replicas and are running on the right infrastructure in order to achieve the right level of performance. Similarly, once a model within a pipeline is identified as a bottleneck, refer back to the models section to optimize the performance of that model.
The performance behavior of Seldon Core 2 pipelines is more complex compared to individual models. Inference request latency can be broken down into:
The sum of the latencies for each model in the critical path of the pipeline (the one containing the bottleneck)
The per-stage data processing overhead caused by pipeline-specific operations (data movement, copying, writing to Kafka topics, etc)
To simplify, first, we can consider a linear pipeline that consists of a chain of sequential components:
The maximum throughput achievable through the pipeline is the minimum of the maximum throughputs achievable by each individual model in the pipeline, given a number of workers. Exceeding this maximum will create a bottleneck in processing, degrading performance for the pipeline. To prevent bottlenecking from a given step in your pipeline, refer back to the section to optimize the performance of that model. For example, you can:
Increase the resources available to a model’s server, as well as the number of workers
Increase the number of model replicas among which the load is balanced (autoscaling can help set the correct number)
In the case of a more complex, non-linear pipeline, the first step would be to identify the critical path within the pipeline (the path containing the model that gets it’s throughput saturated first), and then, based on that critical path, follow the same steps above for any model in that path that creates a bottleneck. There will always be a bottleneck model in a pipeline; the goal is to balance the number of replicas of each model and/or the MLServer resources together with the number of workers so that each stage in the pipeline is able to handle the request throughput with as little queueing as possible happening between the pipeline stages, therefore reducing overall latency.
This section covers various aspects of optimizing model performance in Seldon Core 2, from initial load testing to infrastructure setup and inference optimization. Each subsection provides detailed guidance on different aspects of model performance tuning:
Learn how to conduct effective load testing to understand your model's performance characteristics:
Determining load saturation points
Understanding closed-loop vs. open-loop testing
Determining the right number of replicas based on your configuration (model, infrastructure, etc.)
Setting up reproducible test environments
Interpreting test results for autoscaling configuration
Explore different approaches to optimize inference performance:
Choosing between gRPC and REST protocols
Implementing adaptive batching
Optimizing input dimensions
Configuring parallel processing with workers
Understand how to configure the underlying infrastructure for optimal model performance:
Choosing between CPU and GPU deployments
Setting appropriate CPU specifications
Configuring thread affinity
Managing memory allocation
Each of these aspects plays a crucial role in achieving optimal model performance. We recommend starting with to establish a baseline, then using the insights gained to inform your and strategies.
Learn how to create and manage ML inference pipelines in Seldon Core, including model chaining, tensor mapping, and conditional logic.
Pipelines allow models to be connected into flows of data transformations. This allows more complex machine learning pipelines to be created with multiple models, feature transformations and monitoring components such as drift and outlier detectors.
The simplest way to create Pipelines is by defining them with the . This format is accepted by our Kubernetes implementation but also locally via our seldon CLI.
Internally in both cases Pipelines are created via our . Advanced users could submit Pipelines directly using this gRPC service.
Learn how to manage model artifacts in Seldon Core, including storage, versioning, and deployment workflows.
To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.
Learn about SeldonRuntime, a Kubernetes resource for creating and managing Seldon Core instances in specific namespaces with configurable settings.
The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.
For the definition of SeldonConfiguration above see the .
The specification above contains overrides for the chosen SeldonConfig.
To override the PodSpec for a given component, the overrides field needs to specify the component
name and the PodSpec needs to specify the container name, along with fields to override.
For instance, the following overrides the resource limits for cpu and memory
Installing kube-prometheus-stack in the same Kubernetes cluster that hosts the Seldon Core 2.
kube-prometheus, also known as Prometheus Operator, is a popular open-source project that provides complete monitoring and alerting solutions for Kubernetes clusters. It combines tools and components to create a monitoring stack for Kubernetes environments.
The Seldon Core 2, along with any deployed models, automatically exposes metrics to Prometheus. By default, certain alerting rules are pre-configured, and an alertmanager instance is included.
You can install kube-prometheus to monitor Seldon components, and ensure that the appropriate
type SeldonConfigSpec struct {
Components []*ComponentDefn `json:"components,omitempty"`
Config SeldonConfiguration `json:"config,omitempty"`
}
type SeldonConfiguration struct {
TracingConfig TracingConfig `json:"tracingConfig,omitempty"`
KafkaConfig KafkaConfig `json:"kafkaConfig,omitempty"`
AgentConfig AgentConfiguration `json:"agentConfig,omitempty"`
ServiceConfig ServiceConfig `json:"serviceConfig,omitempty"`
}
type ServiceConfig struct {
GrpcServicePrefix string `json:"grpcServicePrefix,omitempty"`
ServiceType v1.ServiceType `json:"serviceType,omitempty"`
}
type KafkaConfig struct {
BootstrapServers string `json:"bootstrap.servers,omitempty"`
ConsumerGroupIdPrefix string `json:"consumerGroupIdPrefix,omitempty"`
Debug string `json:"debug,omitempty"`
Consumer map[string]intstr.IntOrString `json:"consumer,omitempty"`
Producer map[string]intstr.IntOrString `json:"producer,omitempty"`
Streams map[string]intstr.IntOrString `json:"streams,omitempty"`
TopicPrefix string `json:"topicPrefix,omitempty"`
}
type AgentConfiguration struct {
Rclone RcloneConfiguration `json:"rclone,omitempty" yaml:"rclone,omitempty"`
}
type RcloneConfiguration struct {
ConfigSecrets []string `json:"config_secrets,omitempty" yaml:"config_secrets,omitempty"`
Config []string `json:"config,omitempty" yaml:"config,omitempty"`
}
type TracingConfig struct {
Disable bool `json:"disable,omitempty"`
OtelExporterEndpoint string `json:"otelExporterEndpoint,omitempty"`
OtelExporterProtocol string `json:"otelExporterProtocol,omitempty"`
Ratio string `json:"ratio,omitempty"`
}
type ComponentDefn struct {
// +kubebuilder:validation:Required
Name string `json:"name"`
Labels map[string]string `json:"labels,omitempty"`
Annotations map[string]string `json:"annotations,omitempty"`
Replicas *int32 `json:"replicas,omitempty"`
PodSpec *v1.PodSpec `json:"podSpec,omitempty"`
VolumeClaimTemplates []PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
}Optimizing your model artefact


hodometerseldon-meshseldonConfigdefaultAs a minimal use you should just define the SeldonConfig to use as a base for this install, for
example to install in the seldon-mesh namespace with the SeldonConfig named default:
The helm chart seldon-core-v2-runtime allows easy creation of this resource and associated default
Servers for an installation of Seldon in a particular namespace.
When a SeldonConfig resource changes any SeldonRuntime resources that
reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired
you can set spec.disableAutoUpdate in the SeldonRuntime resource for it not be be updated immediately
but only when it changes or any owned resource changes.
type SeldonRuntimeSpec struct {
SeldonConfig string `json:"seldonConfig"`
Overrides []*OverrideSpec `json:"overrides,omitempty"`
Config SeldonConfiguration `json:"config,omitempty"`
// +Optional
// If set then when the referenced SeldonConfig changes we will NOT update the SeldonRuntime immediately.
// Explicit changes to the SeldonRuntime itself will force a reconcile though
DisableAutoUpdate bool `json:"disableAutoUpdate,omitempty"`
}
type OverrideSpec struct {
Name string `json:"name"`
Disable bool `json:"disable,omitempty"`
Replicas *int32 `json:"replicas,omitempty"`
ServiceType v1.ServiceType `json:"serviceType,omitempty"`
PodSpec *PodSpec `json:"podSpec,omitempty"`
}apiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
name: seldon
namespace: seldon-mesh
spec:
overrides:
- name: hodometer
podSpec:
containers:
- name: hodometer
resources:
limits:
memory: 64Mi
cpu: 20m
seldonConfig: defaultapiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
name: seldon
namespace: seldon-mesh
spec:
seldonConfig: defaultsteps allow you to specify the models you want to combine into a pipeline. Each step name will correspond to a model of the same name. These models will need to have been deployed and available for the Pipeline to function, however Pipelines can be deployed before or at the same time you deploy the underlying models.
steps.inputs allow you to specify the inputs to this step.
output.steps allow you to specify the output of the Pipeline. A pipeline can have multiple paths include flows of data that do not reach the output, e.g. Drift detection steps. However, if you wish to call your Pipeline in a synchronous manner via REST/gRPC then an output must be present so the Pipeline can be treated as a function.
Model step inputs are defined with a dot notation of the form:
Inputs with just a step name will be assumed to be step.outputs.
The default payloads for Pipelines is the V2 protocol which requires named tensors as inputs and outputs from a model. If you require just certain tensors from a model you can reference those in the inputs, e.g. mymodel.outputs.t1 will reference the tensor t1 from the model mymodel.
For the specification of the V2 protocol.
The simplest Pipeline chains models together: the output of one model goes into the input of the next. This will work out of the box if the output tensor names from a model match the input tensor names for the one being chained to. If they do not then the tensorMap construct presently needs to be used to define the mapping explicitly, e.g. see below for a simple chained pipeline of two tfsimple example models:
In the above we rename tensor OUTPUT0 to INPUT0 and OUTPUT1 to INPUT1. This allows these models to be chained together. The shape and data-type of the tensors needs to match as well.
This example can be found in the pipeline examples.
Joining allows us to combine outputs from multiple steps as input to a new step.
Caption: "Joining the outputs of two models into a third model. The dashed lines signify model outputs that are not captured in the output of the pipeline."
Here we pass the pipeline inputs to two models and then take one output tensor from each and pass to the final model. We use the same tensorMap technique to rename tensors as disucssed in the previous section.
Joins can have a join type which can be specified with inputsJoinType and can take the values:
inner: require all inputs to be available to join.
outer: wait for joinWindowMs to join any inputs. Ignoring any inputs that have not sent any data at that point. This will mean this step of the pipeline is guaranteed to have a latency of at least joinWindowMs.
any: wait for any of the specified data sources.
This example can be found in the pipeline examples.
Pipelines can create conditional flows via various methods. We will discuss each in turn.
The simplest way is to create a model that outputs different named tensors based on its decision. This way downstream steps can be dependant on different expected tensors. An example is shown below:
Caption: "Pipeline with a conditional output model. The model conditional only outputs one of the two tensors, so only one path through the graph (red or blue) is taken by a single request"
In the above we have a step conditional that either outputs a tensor named OUTPUT0 or a tensor named OUTPUT1. The mul10 step depends on an output in OUTPUT0 while the add10 step depends on an output from OUTPUT1.
Note, we also have a final Pipeline output step that does an any join on these two models essentially outputting fron the pipeline whichever data arrives from either model. This type of Pipeline can be used for Multi-Armed bandit solutions where you want to route traffic dynamically.
This example can be found in the pipeline examples.
Its also possible to abort pipelines when an error is produced to in effect create a condition. This is illustrated below:
This Pipeline runs normally or throws an error based on whether the input tensors have certain values.
Sometimes you want to run a step if an output is received from a previous step but not to send the data from that step to the model. This is illustrated below:
Caption: "A pipeline with a single trigger. The model tfsimple3 only runs if the model check returns a tensor named OUTPUT. The green edge signifies that this is a trigger and not an additional input to tfsimple3. The dashed lines signify model outputs that are not captured in the output of the pipeline."
In this example the last step tfsimple3 runs only if there are outputs from tfsimple1 and tfsimple2 but also data from the check step. However, if the step tfsimple3 is run it only receives the join of data from tfsimple1 and tfsimple2.
This example can be found in the pipeline examples.
You can also define multiple triggers which need to happen based on a particulr join type. For example:
Caption: "A pipeline with multiple triggers and a trigger join of type any. The pipeline has four inputs, but three of these are optional (signified by the dashed borders)."
Here the mul10 step is run if data is seen on the pipeline inputs in the ok1 or ok2 tensors based on the any join type. If data is seen on ok3 then the add10 step is run.
If we changed the triggersJoinType for mul10 to inner then both ok1 and ok2 would need to appear before mul10 is run.
Pipelines by default can be accessed synchronously via http/grpc or asynchronously via the Kafka topic created for them. However, it's also possible to create a pipeline to take input from one or more other pipelines by specifying an input section. If for example we already have the tfsimple pipeline shown below:
We can create another pipeline which takes its input from this pipeline, as shown below:
Caption: "A pipeline taking as input the output of another pipeline."
In this way pipelines can be built to extend existing running pipelines to allow extensibility and sharing of data flows.
The spec follows the same spec for a step except that references to other pipelines are contained in the externalInputs section which takes the form of pipeline or pipeline.step references:
<pipelineName>.(inputs|outputs).<tensorName>
<pipelineName>.(step).<stepName>.<tensorName>
Tensor names are optional and only needed if you want to take just one tensor from an input or output.
There is also an externalTriggers section which allows triggers from other pipelines.
Further examples can be found in the pipeline-to-pipeline examples.
Present caveats:
Circular dependencies are not presently detected.
Pipeline status is local to each pipeline.
Internally Pipelines are implemented using Kafka. Each input and output to a pipeline step has an associated Kafka topic. This has many advantages and allows auditing, replay and debugging easier as data is preserved from every step in your pipeline.
Tracing allows you to monitor the processing latency of your pipelines.
As each request to a pipelines moves through the steps its data will appear in input and output topics. This allows a full audit of every transformation to be carried out.
DALI
Triton
dali
TBC
Huggingface
MLServer
huggingface
LightGBM
MLServer
lightgbm
MLFlow
MLServer
mlflow
ONNX
Triton
onnx
OpenVino
Triton
openvino
TBC
Custom Python
MLServer
python, mlserver
Custom Python
Triton
python, triton
PyTorch
Triton
pytorch
SKLearn
MLServer
sklearn
Spark Mlib
MLServer
spark-mlib
TBC
Tensorflow
Triton
tensorflow
TensorRT
Triton
tensorrt
TBC
Triton FIL
Triton
fil
TBC
XGBoost
MLServer
xgboost
For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.
Alibi-Detect
.
Alibi-Explain
.
DALI
Follow the Triton docs to create a config.pbtxt and model folder with artifact.
Huggingface
Create an MLServer model-settings.json with the Huggingface model required
For MLServer targeted models you can create a model-settings.json
file to help MLServer load your model and place this alongside your artifact. See theMLServer project for details.
For Triton inference server models you can createa configuration config.pbtxt file alongside your artifact.
The tag field represents the tag you need to add to the requirements part of the Model spec for
your artifact to be loaded on a compatible server. e.g. for an sklearn model:
Alibi-Detect
MLServer
alibi-detect
Alibi-Explain
MLServer
alibi-explain
ServiceMonitorsPodMonitorPrometheusRuleMonitoring the model deployments in Seldon Core 2 involves:
Install Seldon Core 2.
Install Ingress Controller.
Install Grafana in the namespace seldon-monitoring.
Create a namespace for the monitoring components of Seldon Core 2.
Create a YAML file to specify the initial configuration. For example, create the prometheus-values.yaml file. Use your preferred text editor to create and save the file with the following content:
Note: Make sure to include metric-labels-allowlist: pods=[*] in the Helm values file. If you are using your own Prometheus Operator installation, ensure that the pods labels, particularly app.kubernetes.io/managed-by=seldon-core, are part of the collected metrics. These labels are essential for calculating deployment usage rules.
Change to the directory that contains the prometheus-values file and run the following command to install version 9.5.12 of kube-prometheus.
When the installation is complete, you should see this:
Check the status of the installation.
When the installation is complete, you should see this:
You can access Prometheus from outside the cluster by running the following commands:
You can access Alertmanager from outside the cluster by running the following commands:
Apply the Custom RBAC Configuration settings for kube-prometheus.
Configure metrics collection by createing the following PodMonitor resources.
When the resources are created, you should see this:
You may now be able to check the status of Seldon components in Prometheus:
Open your browser and navigate to http://127.0.0.1:9090/ to access Prometheus UI from outside the cluster.
Go to Status and select Targets.
The status of all the endpoints and the scrape details are displayed.
You can view the metrics in Grafana Dashboard after you set Prometheus as the Data Source, and import seldon.json dashboard located at seldon-core/v2.8.2/prometheus/dashboards in GitHub repository.
In this notebook, we will demonstrate how to deploy a production-ready AI application with Seldon Core 2. This application will have two components - an sklearn model and a preprocessor written in Python - leveraging Core 2 Pipelines to connect the two. Once deployed, users will have an endpoint available to call the deployed application. The inference logic can be visualized with the following diagram:
To do this we will:
Set up a Server resource to deploy our models
Deploy an sklearn Model
Deploy a multi-step Pipeline, including a preprocessing step that will be run before calling our model.
Call our inference endpoint, and observe data within our pipeline
As part of the Core 2 installation, you will have install MLServer and Triton Servers:
The server resource outlines attributes (dependency requirements, underlying infrastrucuture) for the runtimes that the models you deploy will run on. By default, MLServer supports the following frameworks out of the box: alibi-detect, alibi-explain, huggingface, lightgbm, mlflow, python, sklearn, spark-mlib, xgboost
In this example, we will create a new custom MLServer that we will tag with income-classifier-deps under capabilities (see docs in order to define which Models will be matched to this Server. In this example, we will deploy both our model (sklearn) and our preproccesor (python) on the same Server. This is done using the manifest below:
Now we will deploy a model - in this case, we are deploying a categorical model that has been trained to take 12 features as input, and output [0] or [1], representing a [Yes] or [No] prediction of whether or not an adult with certain values for the 12 features is making more than $50K/yr. This model was trained using the Census Income (or "Adult") Dataset. Extraction was done by Barry Becker from the 1994 Census database. See for more details.
The model artefact is currently stored in Seldon's a Google bucket - the contents of the relevant folder are below. Alongside our model artefact, we have a model-settings.json file to help locate and load the model. For more information on the Inference artefacts we support and how to configure them, see .
In our Model manifest below, we point to the location of the model artefact using the storageUri field. You will also notice that we have defined income-classifier-deps under requirements. This will match the Model to the Server we deployed above, as Models will only be deployed onto Servers that have capabilities that match the appropriate requirements defined in the Model manifest.
In order to deploy the model, we will apply the manifest to our cluster:
We now have a deployed model, with an associated endpoint.
The endpoint that has been exposed by the above deployment will use an IP from our service mesh that we can obtain as follows:
Requests are made using the Open Inference Protocol. More details on this specification can be found in our , or in the API documentation generated by our protocol buffers in the case of gRPC usage . This protocol is also supported by shared by Triton Inference Server for serving Deep Learning models.
We are now ready to send a request!
We can see above that the model returned a 'data': [0] in the output. This is the prediction of the model, indicating that an individual with the attributes provided is most likely making more than $50K/yr.
Often we'd like to deploy AI applications that are more complex than just an individual model. For example, around our model we could consider deploying pre or post-processing steps, custom logic, other ML models, or drift and outlier detectors.
In this example, we will create a preprocessing step that extracts numerical values from a text file for the model to use as input. This will be implemented with custom logic using Python, and deployed as custom model with MLServer:
Before deploying the preprocessing step with Core 2, we will test it locally:
Now that we've tested the python script locally, we will deploy the preprocessing step as a Model. This will allow us to connect it to our sklearn model using a Seldon Pipeline. To do so, we store in our cloud storage an inference artefact (in this case, our Python script) alongside a model-settings.json file, similar to the model deployed above.
As with the ML model deployed above, we have defined income-classifier-deps under requirements. This means that both the preprocesser and the model will be deployed using the same Server, enabling consolidation in terms of the resources and overheads used (for more about Multi-Model Serving, see ).
We've now deployed the prepocessing step! Let's test it out by calling the endpoint for it:
Now that we have our preprocessor and model deployed, we will chain them together with a pipeline.
The yaml defines two steps in a pipeline (the preprocessor and model), mapping the outputs of the preprocessor model (OUTPUT0) to the input of the income classification model (INPUT0). Seldon Core will leverage Kafka to communicate between models, meaning that all data is streamed and observable in real time.
Congratulations! You have now deployed a Seldon Pipeline that exposes an endpoint for you ML application 🥳. For more tutorials on how to use Core 2 for various use-cases and requirements, see .
Learn how to implement model explainability in Seldon Core using Anchor explainers for both tabular and text data
kubectl apply -f ./models/income.yaml -n ${NAMESPACE}pipeline.mlops.seldon.io/income createdkubectl wait --for condition=ready --timeout=300s model income -n ${NAMESPACE}model.mlops.seldon.io/income condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/income/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income_1",
"model_version": "1",
"id": "c65b8302-85af-4bac-aac5-91e3bedebee8",
"parameters": {},
"outputs": [
When looking at optimizing the latency or throughput of deployed models or pipelines, it is important to consider different approaches to the execution of inference workloads. Below are some tips on different approaches that may be relevant depending on the requirements of a given use-case:
gRPC may be more efficient than REST when your inference request payload benefits from a binary serialization format.
Grouping multiple real-time requests into small batches can improve throughput while maintaining acceptable latency. For more information on adaptive batching in MLServer, see here.
Reducing input size by reducing the dimensions of your inputs can speed up processing. This also reduces (de)serialization overhead that might be needed around model deployments.
For models deployed with MLServer, adjust parallel_workers in line with the number of CPU cores assigned to the Server pod. This is most effective for synchronous models, CPU-bound asynchronous models, and I/O-bound asynchronous models with high throughput. Proper tuning here can improve throughput, stabilize latency distribution, and potentially reduce overall latency due to reduced queuing. This is outlined in more detail in section below.
When deploying models using MLServer, it is possible to execute inference workloads via a pool of workers running in separate processes (see in MLServer docs ).
To assess the throughput behavior of individual model(s) it is first helpful to identify the maximum throughput possible with one worker (one-worker max throughput) and then the maximum throughput possible with N workers (n_workers max throughput). It is important to note that the n_workers max throughput is not simply n_workers × one-worker max throughput because workers run in separate processes, and the OS can only run as many processes in parallel as there are available CPUs. If all workers are CPU-bound, then setting n_workers higher than the number of CPU cores will be ineffective, as the OS will be limited by the number of CPU cores in terms of processes available to parallelize.
However, if some workers are waiting for either I/O or for a GPU, then setting n_workers to a value higher than the number of CPUs can help increase throughput, as some workers can then continue processing while others wait. Generally, if a model is receiving inference requests at a throughput that is lower than the one-worker max throughput, then adding additional workers will not help increase throughput or decrease latency. Similarly, if MLServer is configured with n_workers (on a pod with more than n_CPUs) and the request rate is below the n_workers worker max throughput, latency remains constant - the system is below saturation.
Given the above, it is worth considering increasing the number of workers available to process data for a given deployed model when the system becomes saturated. Increasing the number of workers up to or slightly above the number of CPU cores available may reduce latency when the system is saturated, provided the MLServer pod has sufficient spare CPU. The effect of increasing workers also depends on whether the model is CPU-bound or uses async versus blocking operations, where CPU-bound and blocking models would benefit most. When the system is saturated with requests, those requests will queue. Increasing workers aims to run enough tasks in parallel to cope with higher throughput while minimizing queuing.
If optimizing for speed, the model artefact itself can have a big impact on performance. The speed at which an ML model can return results given input is based on the model’s architecture, model size, the precision of the model’s weights, and input size. In order to reduce the inherent complexity in the data processing required to execute an inference due to the attributes of a model, it is worth considering:
Model pruning to reduce parameters that may be unimportant. This can help reduce model size without having a big impact on the quality of the model’s outputs.
Quantization to reduce the computational and memory overheads of running inference by using model weights and activations with lower precision data types.
Dimensionality reduction of inputs to reduce the complexity of computation.
Efficient model architectures
Overall performance of your models will be constrained by the specifications of the underlying hardware which it is run on, and how it is leveraged. Choosing between CPUs and GPUs depends on the latency and throughput requirements for your use-case, as well as the type of model you are putting into production:
CPUs are generally sufficient for lightweight models, such as tree-based models, regression models, or small neural networks.
GPUs are recommended for deep learning models, large-scale neural networks, and large language models (LLMs) where lower latency is critical. Models with high matrix computation demands (like transformers or CNNs) benefit significantly from GPU acceleration.
If cost is a concern, it is recommended to start with CPUs and use profiling or performance monitoring tools (e.g. as py-spy or scalene) to identify CPU bottlenecks. Based on these results, you can transition to GPUs as needed.
If working with models that will receive many concurrent requests in production, individual CPUs can often act as bottlenecks when processing data. In these cases, increasing the parallel workers can help. This can be configured through your serving solution as described in the section. It is important to note that when increasing the number of workers available to process concurrent requests, it is best practice to ensure the number of workers is not significantly higher than the number of available CPU cores, in order to reduce contention. Each worker executes in it’s own process. This is most relevant for synchronous models where subsequent processing is blocked on completion of each request.
For more advanced configuration of CPU utilization, users can configure thread affinity through environment variables which determine how threads are bound to physical processors. For example, KMP_AFFINITY and OMP_NUM_THREADS are variables that can be set for technologies that use OpenMP. For more information on thread affinity, see . In general, the ML Framework that you’re using might have it’s own recommendations for improving resource usage.
Finally, increasing the RAM available for your models can improve performance for memory intensive models, such as models with large parameter sizes, ones that require high-dimensional data processing, or involve complex intermediate computations.
Learn how to test and validate metrics collection in Seldon Core locally, including Prometheus setup and Grafana dashboards.
Learn how to leverage Core 2's native autoscaling functionality for Models
In order to set up autoscaling, users should first identify which metric they would want to scale their models on. Seldon Core provides an approach to autoscale models based on Inference Lag, or supports more custom scaling logic by leveraging HPA, (or ), whereby you can use custom metrics to automatically scale Kubernetes resources. This page will go through the first approach. Inference Lag refers to the difference in incoming vs. outgoing requests in a given period of time. If choosing this approach, it is recommended to configure autoscaling for Servers, so that Models scale on Inference Lag, and in turn set up to scale based on model needs.
This implementation of autoscaling is enabled if Core 2 is installed with the autoscaling.autoscalingModelEnabled helm value set to true (default is false) and at least MinReplicas or MaxReplicas is set in the Model Custom Resource. Then according to lag (how much the model "falls behind" in terms of serving inference requests) the system will scale the number of Replicas
Learn how to monitor operational metrics in Seldon Core, including model performance, pipeline health, and system resource usage.
While the system runs, Prometheus collects metrics that enable you to observe various aspects of Seldon Core 2, including throughput, latency, memory, and CPU usage. In addition to the standard Kubernetes metrics scraped by Prometheus, a provides a comprehensive system overview.
The list of Seldon Core 2 metrics that are compiling is as follows.
For the agent that sits next to the inference servers:
For the pipeline gateway that handles requests to pipelines:
Many of these metrics are model and pipeline level counters and gauges. Some of these metrics are aggregated to speed up the display of graphs. Currently,per-model histogram metrics are not stored for performance reasons. However, per-pipeline histogram metrics are stored.
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: chain
namespace: seldon-mesh
spec:
steps:
- name: model1
- name: model2
inputs:
- model1
output:
steps:
- model2<stepName>|<pipelineName>.<inputs|outputs>.<tensorName>apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: join
spec:
steps:
- name: tfsimple1
- name: tfsimple2
- name: tfsimple3
inputs:
- tfsimple1.outputs.OUTPUT0
- tfsimple2.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple2.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple3apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-conditional
spec:
steps:
- name: conditional
- name: mul10
inputs:
- conditional.outputs.OUTPUT0
tensorMap:
conditional.outputs.OUTPUT0: INPUT
- name: add10
inputs:
- conditional.outputs.OUTPUT1
tensorMap:
conditional.outputs.OUTPUT1: INPUT
output:
steps:
- mul10
- add10
stepsJoin: anyapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: error
spec:
steps:
- name: outlier-error
output:
steps:
- outlier-errorapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: joincheck
spec:
steps:
- name: tfsimple1
- name: tfsimple2
- name: check
inputs:
- tfsimple1.outputs.OUTPUT0
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT
- name: tfsimple3
inputs:
- tfsimple1.outputs.OUTPUT0
- tfsimple2.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple2.outputs.OUTPUT1: INPUT1
triggers:
- check.outputs.OUTPUT
output:
steps:
- tfsimple3apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: trigger-joins
spec:
steps:
- name: mul10
inputs:
- trigger-joins.inputs.INPUT
triggers:
- trigger-joins.inputs.ok1
- trigger-joins.inputs.ok2
triggersJoinType: any
- name: add10
inputs:
- trigger-joins.inputs.INPUT
triggers:
- trigger-joins.inputs.ok3
output:
steps:
- mul10
- add10
stepsJoin: anyapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
requirements:
- sklearn
memory: 100Kikubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"fullnameOverride: seldon-monitoring
kube-state-metrics:
extraArgs:
metric-labels-allowlist: pods=[*]echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090echo "Alertmanager URL: http://127.0.0.1:9093/"
kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093CUSTOM_RBAC=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2/prometheus/rbac
kubectl apply -f ${CUSTOM_RBAC}/cr.yamlPODMONITOR_RESOURCE_LOCATION=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/monitors
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/agent-podmonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/envoy-servicemonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/pipelinegateway-podmonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/server-podmonitor.yamlcat ./models/income.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
requirements:
- sklearn
LightGBM
Save model to file with extension.bst.
MLFlow
Use the created artifacts/model folder from your training run.
ONNX
Save you model with name model.onnx.
OpenVino
Follow the Triton docs to create your model artifacts.
Custom MLServer Python
Create a python file with a class that extends MLModel.
Custom Triton Python
Follow the Triton docs to create your config.pbtxt and associated python files.
PyTorch
Create a Triton config.pbtxt describing inputs and outputs and place traced torchscript in folder as model.pt.
SKLearn
Save model via joblib to a file with extension .joblib or with pickle to a file with extension .pkl or .pickle.
Spark Mlib
Follow the MLServer docs.
Tensorflow
Save model in "Saved Model" format as model.savedodel. If using graphdef format you will need to create Triton config.pbtxt and place your model in a numbered sub folder. HDF5 is not supported.
TensorRT
Follow the Triton docs to create your model artifacts.
Triton FIL
Follow the Triton docs to create your model artifacts.
XGBoost
Save model to file with extension.bst or .json.
Optimized model formats and runtimes like ONNX Runtime, TensorRT, or OpenVINO, which leverage hardware-specific acceleration for improved performance.


This is experimental, and these metrics are expected to evolve to better capture relevant trends as more information becomes available about system usage.
An example to show raw metrics that Prometheus will scrape.
// scheduler/pkg/metrics/agent.go
const (
// Histograms do no include pipeline label for efficiency
modelHistogramName = "seldon_model_infer_api_seconds"
// We use base infer counters to store core metrics per pipeline
modelInferCounterName = "seldon_model_infer_total"
modelInferLatencyCounterName = "seldon_model_infer_seconds_total"
modelAggregateInferCounterName = "seldon_model_aggregate_infer_total"
modelAggregateInferLatencyCounterName = "seldon_model_aggregate_infer_seconds_total"
)
// Agent metrics
const (
cacheEvictCounterName = "seldon_cache_evict_count"
cacheMissCounterName = "seldon_cache_miss_count"
loadModelCounterName = "seldon_load_model_counter"
unloadModelCounterName = "seldon_unload_model_counter"
loadedModelGaugeName = "seldon_loaded_model_gauge"
loadedModelMemoryGaugeName = "seldon_loaded_model_memory_bytes_gauge"
evictedModelMemoryGaugeName = "seldon_evicted_model_memory_bytes_gauge"
serverReplicaMemoryCapacityGaugeName = "seldon_server_replica_memory_capacity_bytes_gauge"
serverReplicaMemoryCapacityWithOverCommitGaugeName = "seldon_server_replica_memory_capacity_overcommit_bytes_gauge"
)Note: for compatibility of Tritonclient check this issue.
Note: binary data support in HTTP is blocked by this issue
Note: binary data support in HTTP is blocked by https://github.com/SeldonIO/seldon-core-v2/issues/475
Load the model.
mlserver_metrics_host="0.0.0.0:9006"
triton_metrics_host="0.0.0.0:9007"
pipeline_metrics_host="0.0.0.0:9009"seldon model load -f ./models/sklearn-iris-gs.yaml
seldon model status iris -w ModelAvailable | jq -M .{}
{}When the system autoscales, the initial model spec is not changed (e.g. the number of replicas) and therefore the user cannot reset the number of replicas back to the initial specified value without an explicit change to a different value first. If only replicas is specified by the user, autoscaling of models is disabled and the system will have exactly the number of replicas of this model deployed regardless of inference lag.
The scale-up and scale-down logic, and it's configurability is described below:
Scale Up: To trigger scale up with the approach described above, we use Inference Lag as the metrics. Inference Lag is the difference between incoming and outgoing requests in a given time period. If the lag crosses a threshold, then we trigger a model scale up event. This threshold can be defined via SELDON_MODEL_INFERENCE_LAG_THRESHOLD inference server environment variable. The threshold used will apply to all the models hosted on the Server where the lag was configured.
Scale Down: When using Model autoscaling that is managed by Seldon Core, model scale down events are triggered if a model has not been used for a number of seconds. This is defined in SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD inference server environment variable.
Rate of metrics calculation: Each agent checks the above stats periodically and if any model hits the corresponding threshold, then the agent sends an event to the scheduler to request model scaling. How often this process executes can be defined via SELDON_SCALING_STATS_PERIOD_SECONDS inference server environment variable.
Based on the logic above, the scheduler will trigger model autoscaling if:
The model is stable (no state change in the last 5 minutes) and available.
The desired number of replicas is within range. Note we always have a least 1 replica of any deployed model and we rely on over commit to reduce the resources used further.
For scaling up the model when autoscaling of the Servers is not set up, trigger the scale-up only if there are sufficient server replicas that can load the new model replicas.
If autoscaling models with the approach above, it is recommended to autoscale servers based on using Seldon's Server autoscaling (configured by setting MinReplicas and MaxReplicas for the Server CR - see below). Without Server autoscaling configured, the required number of servers will not necessarily spin up, even if the desired number of model replicas cannot be currently fulfilled by the current provisioned number of servers. Setting up Server Autoscaling is described in more detail below.
helm upgrade --install prometheus kube-prometheus \
--version 9.5.12 \
--namespace seldon-monitoring \
--values prometheus-values.yaml \
--repo https://charts.bitnami.com/bitnamiWARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
- alertmanager.resources
- blackboxExporter.resources
- operator.resources
- prometheus.resources
- prometheus.thanos.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operatorWaiting for deployment "seldon-monitoring-operator" rollout to finish: 0 of 1 updated replicas are available...
deployment "seldon-monitoring-operator" successfully rolled outpodmonitor.monitoring.coreos.com/agent created
servicemonitor.monitoring.coreos.com/envoy created
podmonitor.monitoring.coreos.com/pipelinegateway created
podmonitor.monitoring.coreos.com/server createdseldon model unload mnist-pytorchseldon model load -f ./models/income.yaml{}seldon model status income -w ModelAvailable{}seldon model infer income \
'{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income_1",
"model_version": "1",
"id": "c65b8302-85af-4bac-aac5-91e3bedebee8",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"data": [
0
]
}
]
}
cat ./models/income-explainer.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
explainer:
type: anchor_tabular
modelRef: income
kubectl apply -f ./models/income-explainer.yaml -n ${NAMESPACE}model.mlops.seldon.io/income-explainer createdkubectl wait --for condition=ready --timeout=300s model income-explainer -n ${NAMESPACE}model.mlops.seldon.io/income-explainer condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income-explainer_1",
"model_version": "1",
"id": "a22c3785-ff3b-4504-9b3c-199aa48a62d6",
"parameters": {},
"outputs": [
{
"name": "explanation",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "str"
},
"data": [
"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.0\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9518716577540107, \"coverage\": 0.07165109034267912, \"raw\": {\"feature\": [3, 5], \"mean\": [0.7959381044487428, 0.9518716577540107], \"precision\": [0.7959381044487428, 0.9518716577540107], \"coverage\": [0.3037383177570093, 0.07165109034267912], \"examples\": [{\"covered_true\": [[52, 5, 5, 1, 8, 1, 2, 0, 0, 0, 50, 9], [49, 4, 1, 1, 4, 4, 1, 0, 0, 0, 40, 1], [23, 4, 1, 1, 6, 1, 4, 1, 0, 0, 40, 9], [55, 2, 1, 1, 5, 1, 4, 0, 0, 0, 48, 9], [22, 4, 1, 1, 2, 3, 4, 0, 0, 0, 15, 9], [51, 4, 2, 1, 5, 0, 1, 1, 0, 0, 99, 4], [40, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [40, 6, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [50, 5, 5, 1, 6, 0, 4, 1, 0, 0, 55, 9], [41, 4, 1, 1, 6, 0, 4, 1, 0, 0, 40, 9]], \"covered_false\": [[42, 4, 1, 1, 8, 0, 4, 1, 0, 2415, 60, 9], [48, 6, 2, 1, 5, 4, 4, 0, 0, 0, 60, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [57, 4, 5, 1, 8, 0, 4, 1, 0, 0, 50, 9], [63, 7, 2, 1, 8, 0, 4, 1, 0, 1902, 50, 9], [51, 4, 5, 1, 8, 0, 4, 1, 0, 1887, 47, 9], [51, 2, 2, 1, 8, 1, 4, 0, 0, 0, 45, 9], [68, 7, 5, 1, 5, 0, 4, 1, 0, 2377, 42, 0], [45, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 40, 9], [45, 4, 1, 1, 8, 0, 4, 1, 0, 1977, 60, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[44, 6, 5, 1, 8, 3, 4, 0, 0, 1902, 60, 9], [58, 7, 2, 1, 5, 3, 1, 1, 4064, 0, 40, 1], [50, 7, 1, 1, 1, 3, 2, 0, 0, 0, 37, 9], [34, 4, 2, 1, 5, 3, 4, 1, 0, 0, 45, 9], [45, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [33, 7, 5, 1, 5, 3, 1, 1, 0, 0, 30, 6], [61, 7, 2, 1, 5, 3, 4, 1, 0, 0, 40, 0], [35, 4, 5, 1, 1, 3, 4, 1, 0, 0, 40, 9], [71, 2, 1, 1, 5, 3, 4, 0, 0, 0, 6, 9], [44, 4, 1, 1, 8, 3, 2, 1, 0, 0, 35, 9]], \"covered_false\": [[30, 4, 5, 1, 5, 3, 4, 1, 10520, 0, 40, 9], [54, 7, 2, 1, 8, 3, 4, 1, 0, 1902, 50, 9], [66, 6, 2, 1, 6, 3, 4, 1, 0, 2377, 25, 9], [35, 4, 2, 1, 5, 3, 4, 1, 7298, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7298, 0, 48, 9], [31, 4, 1, 1, 8, 3, 4, 0, 13550, 0, 50, 9], [35, 4, 1, 1, 8, 3, 4, 1, 8614, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
]
}
]
}
kubectl delete -f ./models/income-explainer.yaml -n ${NAMESPACE}
kubectl delete -f ./models/income.yaml -n ${NAMESPACE}
seldon model load -f ./models/income-explainer.yaml{}
seldon model status income-explainer -w ModelAvailable{}
seldon model infer income-explainer \
'{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income-explainer_1",
"model_version": "1",
"id": "a22c3785-ff3b-4504-9b3c-199aa48a62d6",
"parameters": {},
"outputs": [
{
"name": "explanation",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "str"
},
"data": [
"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.0\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9518716577540107, \"coverage\": 0.07165109034267912, \"raw\": {\"feature\": [3, 5], \"mean\": [0.7959381044487428, 0.9518716577540107], \"precision\": [0.7959381044487428, 0.9518716577540107], \"coverage\": [0.3037383177570093, 0.07165109034267912], \"examples\": [{\"covered_true\": [[52, 5, 5, 1, 8, 1, 2, 0, 0, 0, 50, 9], [49, 4, 1, 1, 4, 4, 1, 0, 0, 0, 40, 1], [23, 4, 1, 1, 6, 1, 4, 1, 0, 0, 40, 9], [55, 2, 1, 1, 5, 1, 4, 0, 0, 0, 48, 9], [22, 4, 1, 1, 2, 3, 4, 0, 0, 0, 15, 9], [51, 4, 2, 1, 5, 0, 1, 1, 0, 0, 99, 4], [40, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [40, 6, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [50, 5, 5, 1, 6, 0, 4, 1, 0, 0, 55, 9], [41, 4, 1, 1, 6, 0, 4, 1, 0, 0, 40, 9]], \"covered_false\": [[42, 4, 1, 1, 8, 0, 4, 1, 0, 2415, 60, 9], [48, 6, 2, 1, 5, 4, 4, 0, 0, 0, 60, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [57, 4, 5, 1, 8, 0, 4, 1, 0, 0, 50, 9], [63, 7, 2, 1, 8, 0, 4, 1, 0, 1902, 50, 9], [51, 4, 5, 1, 8, 0, 4, 1, 0, 1887, 47, 9], [51, 2, 2, 1, 8, 1, 4, 0, 0, 0, 45, 9], [68, 7, 5, 1, 5, 0, 4, 1, 0, 2377, 42, 0], [45, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 40, 9], [45, 4, 1, 1, 8, 0, 4, 1, 0, 1977, 60, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[44, 6, 5, 1, 8, 3, 4, 0, 0, 1902, 60, 9], [58, 7, 2, 1, 5, 3, 1, 1, 4064, 0, 40, 1], [50, 7, 1, 1, 1, 3, 2, 0, 0, 0, 37, 9], [34, 4, 2, 1, 5, 3, 4, 1, 0, 0, 45, 9], [45, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [33, 7, 5, 1, 5, 3, 1, 1, 0, 0, 30, 6], [61, 7, 2, 1, 5, 3, 4, 1, 0, 0, 40, 0], [35, 4, 5, 1, 1, 3, 4, 1, 0, 0, 40, 9], [71, 2, 1, 1, 5, 3, 4, 0, 0, 0, 6, 9], [44, 4, 1, 1, 8, 3, 2, 1, 0, 0, 35, 9]], \"covered_false\": [[30, 4, 5, 1, 5, 3, 4, 1, 10520, 0, 40, 9], [54, 7, 2, 1, 8, 3, 4, 1, 0, 1902, 50, 9], [66, 6, 2, 1, 6, 3, 4, 1, 0, 2377, 25, 9], [35, 4, 2, 1, 5, 3, 4, 1, 7298, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7298, 0, 48, 9], [31, 4, 1, 1, 8, 3, 4, 0, 13550, 0, 50, 9], [35, 4, 1, 1, 8, 3, 4, 1, 8614, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
]
}
]
}
seldon model unload income-explainer{}
seldon model unload income{}
cat ./models/moviesentiment.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/moviesentiment-sklearn"
requirements:
- sklearn
kubectl apply -f ./models/moviesentiment.yaml -n ${NAMESPACE}model.mlops.seldon.io/sentiment createdkubectl wait --for condition=ready --timeout=300s model sentiment -n ${NAMESPACE}model.mlops.seldon.io/sentiment condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'{
"model_name": "sentiment_2",
"model_version": "1",
"id": "f5c07363-7e9d-4f09-aa30-228c81fdf4a4",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
0
]
}
]
}
seldon model load -f ./models/moviesentiment.yaml{}
seldon model status sentiment -w ModelAvailable{}
seldon model infer sentiment \
'{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'{
"model_name": "sentiment_2",
"model_version": "1",
"id": "f5c07363-7e9d-4f09-aa30-228c81fdf4a4",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
0
]
}
]
}
cat ./models/moviesentiment-explainer.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/moviesentiment-sklearn-explainer"
explainer:
type: anchor_text
modelRef: sentiment
kubectl apply -f ./models/moviesentiment-explainer.yaml -n ${NAMESPACE}model.mlops.seldon.io/sentiment-explainer createdkubectl wait --for condition=ready --timeout=300s model sentiment-explainer -n ${NAMESPACE}model.mlops.seldon.io/sentiment-explainer condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'seldon model load -f ./models/moviesentiment-explainer.yaml{}
seldon model status sentiment-explainer -w ModelAvailable{}
seldon model infer sentiment-explainer \
'{"parameters": {"content_type": "str"}, "inputs": [{"name": "foo", "data": ["I am good"], "datatype": "BYTES","shape": [1]}]}'Error: V2 server error: 500 Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/opt/conda/lib/python3.8/site-packages/starlette_exporter/middleware.py", line 307, in __call__
await self.app(scope, receive, wrapped_send)
File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/gzip.py", line 24, in __call__
await responder(scope, receive, send)
File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/gzip.py", line 44, in __call__
await self.app(scope, receive, self.send_with_gzip)
File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/opt/conda/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/opt/conda/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/opt/conda/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 706, in __call__
await route.handle(scope, receive, send)
File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/opt/conda/lib/python3.8/site-packages/mlserver/rest/app.py", line 42, in custom_route_handler
return await original_route_handler(request)
File "/opt/conda/lib/python3.8/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/opt/conda/lib/python3.8/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
return await dependant.call(**values)
File "/opt/conda/lib/python3.8/site-packages/mlserver/rest/endpoints.py", line 99, in infer
inference_response = await self._data_plane.infer(
File "/opt/conda/lib/python3.8/site-packages/mlserver/handlers/dataplane.py", line 103, in infer
prediction = await model.predict(payload)
File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/runtime.py", line 86, in predict
output_data = await self._async_explain_impl(input_data, payload.parameters)
File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/runtime.py", line 119, in _async_explain_impl
explanation = await loop.run_in_executor(self._executor, explain_call)
File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 62, in _explain_impl
input_data = input_data[0]
KeyError: 0
kubectl delete -f ./models/moviesentiment-explainer.yaml -n ${NAMESPACE}kubectl delete -f ./models/moviesentiment.yaml -n ${NAMESPACE}seldon model unload sentiment-explainer{}seldon model unload sentiment{}// scheduler/pkg/metrics/gateway.go
// The aggregate metrics exist for efficiency, as the summation can be
// very slow in Prometheus when many pipelines exist.
const (
// Histograms do no include model label for efficiency
pipelineHistogramName = "seldon_pipeline_infer_api_seconds"
// We use base infer counters to store core metrics per pipeline
pipelineInferCounterName = "seldon_pipeline_infer_total"
pipelineInferLatencyCounterName = "seldon_pipeline_infer_seconds_total"
pipelineAggregateInferCounterName = "seldon_pipeline_aggregate_infer_total"
pipelineAggregateInferLatencyCounterName = "seldon_pipeline_aggregate_infer_seconds_total"
)pip install tritonclient[all]import os
os.environ["NAMESPACE"] = "seldon-mesh"MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP=MESH_IP[0]
import os
os.environ['MESH_IP'] = MESH_IP
MESH_IP'172.19.255.1'
cat models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Kicat pipelines/iris.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: iris-pipeline
spec:
steps:
- name: iris
output:
steps:
- iris
kubectl apply -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
kubectl apply -f pipelines/iris.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris created
pipeline.mlops.seldon.io/iris-pipeline created
kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s pipelines iris-pipeline -n ${NAMESPACE}model.mlops.seldon.io/iris condition met
pipeline.mlops.seldon.io/iris-pipeline condition met
import tritonclient.http as httpclient
import numpy as np
http_triton_client = httpclient.InferenceServerClient(
url=f"{MESH_IP}:80",
verbose=False,
)
print("model ready:", http_triton_client.is_model_ready("iris"))
print("model metadata:", http_triton_client.get_model_metadata("iris"))model ready: True
model metadata: {'name': 'iris_1', 'versions': [], 'platform': '', 'inputs': [], 'outputs': [], 'parameters': {}}
# Against model
binary_data = False
inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"), binary_data=binary_data)
outputs = [httpclient.InferRequestedOutput("predict", binary_data=binary_data)]
result = http_triton_client.infer("iris", inputs, outputs=outputs)
result.as_numpy("predict")array([[2]])
# Against pipeline
binary_data = False
inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"), binary_data=binary_data)
outputs = [httpclient.InferRequestedOutput("predict", binary_data=binary_data)]
result = http_triton_client.infer("iris-pipeline.pipeline", inputs, outputs=outputs)
result.as_numpy("predict")array([[2]])
import tritonclient.grpc as grpcclient
import numpy as np
grpc_triton_client = grpcclient.InferenceServerClient(
url=f"{MESH_IP}:80",
verbose=False,
)model_name = "iris"
headers = {"seldon-model": model_name}
print("model ready:", grpc_triton_client.is_model_ready(model_name, headers=headers))
print(grpc_triton_client.get_model_metadata(model_name, headers=headers))model ready: True
name: "iris_1"
model_name = "iris"
headers = {"seldon-model": model_name}
inputs = [
grpcclient.InferInput("predict", (1, 4), "FP64"),
]
inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"))
outputs = [grpcclient.InferRequestedOutput("predict")]
result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
result.as_numpy("predict")array([[2]])
model_name = "iris-pipeline.pipeline"
headers = {"seldon-model": model_name}
inputs = [
grpcclient.InferInput("predict", (1, 4), "FP64"),
]
inputs[0].set_data_from_numpy(np.array([[1, 2, 3, 4]]).astype("float64"))
outputs = [grpcclient.InferRequestedOutput("predict")]
result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
result.as_numpy("predict")array([[2]])
cat models/tfsimple1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
cat pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl apply -f models/tfsimple1.yaml -n ${NAMESPACE}
kubectl apply -f pipelines/tfsimple.yaml -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 created
pipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s pipelines tfsimple -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 condition met
pipeline.mlops.seldon.io/tfsimple condition met
import tritonclient.http as httpclient
import numpy as np
http_triton_client = httpclient.InferenceServerClient(
url=f"{MESH_IP}:80",
verbose=False,
)
print("model ready:", http_triton_client.is_model_ready("iris"))
print("model metadata:", http_triton_client.get_model_metadata("iris"))model ready: True
model metadata: {'name': 'iris_1', 'versions': [], 'platform': '', 'inputs': [], 'outputs': [], 'parameters': {}}
# Against model (no binary data)
binary_data = False
inputs = [
httpclient.InferInput("INPUT0", (1, 16), "INT32"),
httpclient.InferInput("INPUT1", (1, 16), "INT32"),
]
inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
result = http_triton_client.infer("tfsimple1", inputs, outputs=outputs)
result.as_numpy("OUTPUT0")array([[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
dtype=int32)
# Against model (with binary data)
binary_data = True
inputs = [
httpclient.InferInput("INPUT0", (1, 16), "INT32"),
httpclient.InferInput("INPUT1", (1, 16), "INT32"),
]
inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
result = http_triton_client.infer("tfsimple1", inputs, outputs=outputs)
result.as_numpy("OUTPUT0")array([[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
dtype=int32)
# Against Pipeline (no binary data)
binary_data = False
inputs = [
httpclient.InferInput("INPUT0", (1, 16), "INT32"),
httpclient.InferInput("INPUT1", (1, 16), "INT32"),
]
inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
result = http_triton_client.infer("tfsimple.pipeline", inputs, outputs=outputs)
result.as_numpy("OUTPUT0")array([[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
dtype=int32)
## binary data does not work with http behind pipeline
# import numpy as np
# binary_data = True
# inputs = [
# httpclient.InferInput("INPUT0", (1, 16), "INT32"),
# httpclient.InferInput("INPUT1", (1, 16), "INT32"),
# ]
# inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
# inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"), binary_data=binary_data)
# outputs = [httpclient.InferRequestedOutput("OUTPUT0", binary_data=binary_data)]
# result = http_triton_client.infer("tfsimple.pipeline", inputs, outputs=outputs)
# result.as_numpy("OUTPUT0")import tritonclient.grpc as grpcclient
import numpy as np
grpc_triton_client = grpcclient.InferenceServerClient(
url=f"{MESH_IP}:80",
verbose=False,
)model_name = "tfsimple1"
headers = {"seldon-model": model_name}
print("model ready:", grpc_triton_client.is_model_ready(model_name, headers=headers))
print(grpc_triton_client.get_model_metadata(model_name, headers=headers))model ready: True
name: "tfsimple1_1"
versions: "1"
platform: "tensorflow_graphdef"
inputs {
name: "INPUT0"
datatype: "INT32"
shape: -1
shape: 16
}
inputs {
name: "INPUT1"
datatype: "INT32"
shape: -1
shape: 16
}
outputs {
name: "OUTPUT0"
datatype: "INT32"
shape: -1
shape: 16
}
outputs {
name: "OUTPUT1"
datatype: "INT32"
shape: -1
shape: 16
}
# Against Model
model_name = "tfsimple1"
headers = {"seldon-model": model_name}
inputs = [
grpcclient.InferInput("INPUT0", (1, 16), "INT32"),
grpcclient.InferInput("INPUT1", (1, 16), "INT32"),
]
inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
outputs = [grpcclient.InferRequestedOutput("OUTPUT0")]
result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
result.as_numpy("OUTPUT0")array([[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
dtype=int32)
# Against Pipeline
model_name = "tfsimple.pipeline"
headers = {"seldon-model": model_name}
inputs = [
grpcclient.InferInput("INPUT0", (1, 16), "INT32"),
grpcclient.InferInput("INPUT1", (1, 16), "INT32"),
]
inputs[0].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
inputs[1].set_data_from_numpy(np.arange(1, 17).reshape(-1, 16).astype("int32"))
outputs = [grpcclient.InferRequestedOutput("OUTPUT0")]
result = grpc_triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
result.as_numpy("OUTPUT0")array([[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]],
dtype=int32)
kubectl delete -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
kubectl delete -f pipelines/iris.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deleted
pipeline.mlops.seldon.io "iris-pipeline" deleted
kubectl delete -f models/tfsimple1.yaml -n ${NAMESPACE}
kubectl delete -f pipelines/tfsimple.yaml -n ${NAMESPACE}model.mlops.seldon.io "tfsimple1" deleted
pipeline.mlops.seldon.io "tfsimple" deleted
from prometheus_client.parser import text_string_to_metric_families
import requests
def scrape_metrics(host):
data = requests.get(f"http://{host}/metrics").text
return {
family.name: family for family in text_string_to_metric_families(data)
}
def print_sample(family, label, value):
for sample in family.samples:
if sample.labels[label] == value:
print(sample)
def get_model_infer_count(host, model_name):
metrics = scrape_metrics(host)
family = metrics["seldon_model_infer"]
print_sample(family, "model", model_name)
def get_pipeline_infer_count(host, pipeline_name):
metrics = scrape_metrics(host)
family = metrics["seldon_pipeline_infer"]
print_sample(family, "pipeline", pipeline_name)seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris --inference-mode grpc -i 100 \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'Success: map[:iris_1::100]
get_model_infer_count(mlserver_metrics_host,"iris")Sample(name='seldon_model_infer_total', labels={'code': '200', 'method_type': 'rest', 'model': 'iris', 'model_internal': 'iris_1', 'server': 'mlserver', 'server_replica': '0'}, value=50.0, timestamp=None, exemplar=None)
Sample(name='seldon_model_infer_total', labels={'code': 'OK', 'method_type': 'grpc', 'model': 'iris', 'model_internal': 'iris_1', 'server': 'mlserver', 'server_replica': '0'}, value=100.0, timestamp=None, exemplar=None)seldon model unload iris{}
seldon model load -f ./models/tfsimple1.yaml
seldon model status tfsimple1 -w ModelAvailable | jq -M .{}
{}
seldon model infer tfsimple1 -i 50\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'Success: map[:tfsimple1_1::50]
seldon model infer tfsimple1 --inference-mode grpc -i 100 \
'{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}'Success: map[:tfsimple1_1::100]
get_model_infer_count(triton_metrics_host,"tfsimple1")Sample(name='seldon_model_infer_total', labels={'code': '200', 'method_type': 'rest', 'model': 'tfsimple1', 'model_internal': 'tfsimple1_1', 'server': 'triton', 'server_replica': '0'}, value=50.0, timestamp=None, exemplar=None)
Sample(name='seldon_model_infer_total', labels={'code': 'OK', 'method_type': 'grpc', 'model': 'tfsimple1', 'model_internal': 'tfsimple1_1', 'server': 'triton', 'server_replica': '0'}, value=100.0, timestamp=None, exemplar=None)
seldon model unload tfsimple1{}
seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml
seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .
seldon pipeline load -f ./pipelines/tfsimples.yaml
seldon pipeline status tfsimples -w PipelineReady{}
{}
{}
{}
{}
{"pipelineName":"tfsimples", "versions":[{"pipeline":{"name":"tfsimples", "uid":"cdqji39qa12c739ab3o0", "version":2, "steps":[{"name":"tfsimple1"}, {"name":"tfsimple2", "inputs":["tfsimple1.outputs"], "tensorMap":{"tfsimple1.outputs.OUTPUT0":"INPUT0", "tfsimple1.outputs.OUTPUT1":"INPUT1"}}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":2, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2022-11-16T19:25:01.255955114Z"}}]}seldon pipeline infer tfsimples -i 50 \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'Success: map[:tfsimple1_1::50 :tfsimple2_1::50 :tfsimples.pipeline::50]get_pipeline_infer_count(pipeline_metrics_host,"tfsimples")Sample(name='seldon_pipeline_infer_total', labels={'code': '200', 'method_type': 'rest', 'pipeline': 'tfsimples', 'server': 'pipeline-gateway'}, value=50.0, timestamp=None, exemplar=None)seldon model unload tfsimple1
seldon model unload tfsimple2
seldon pipeline unload tfsimples{}
{}
{}# samples/models/tfsimple_scaling.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
minReplicas: 1
replicas: 1Learn how to set up a self-hosted Kafka cluster for Seldon Core in development and learning environments.
You can run Kafka in the same Kubernetes cluster that hosts the Seldon Core 2. Seldon recommends using the Strimzi operator for Kafka installation and maintenance. For more details about configuring Kafka with Seldon Core 2 see the Configuration section.
Integrating self-hosted Kafka with Seldon Core 2 includes these steps:
Strimzi provides a Kubernetes Operator to deploy and manage Kafka clusters. First, we need to install the Strimzi Operator in your Kubernetes cluster.
Create a namespace where you want to install Kafka. For example the namespace seldon-mesh:
Install Strimzi.
Install Strimzi Operator.
This deploys the Strimzi Operator in the seldon-mesh namespace. After the Strimzi Operator is running, you can create a Kafka cluster by applying a Kafka custom resource definition.
Apply the Kafka cluster configuration.
Create a YAML file named kafka-nodepool.yaml to create a nodepool for the kafka cluster.
Apply the Kafka node pool configuration.
Check the status of the Kafka Pods to ensure they are running properly:
You should see multiple Pods for Kafka, and Strimzi operators running.
Error
The Pod that begins with the name seldon-dataflow-engine does not show the status as Running.
One of the possible reasons could be that the DNS resolution for the service failed.
Solution
Check the logs of the Pod <seldon-dataflow-engine>:
In the output check if a message reads:
Verify the name in the metadata for the kafka.yaml and kafka-nodepool.yaml. It should read seldon
When the SeldonRuntime is installed in a namespace a ConfigMap is created with the
settings for Kafka configuration. Update the ConfigMap only if you need to customize the configurations.
Verify that the ConfigMap resource named seldon-kafka that is created in the namespace seldon-mesh:
You should have the ConfigMaps for Kafka, Zookeeper, Strimzi operators, and others.
View the configuration of the the ConfigMap named seldon-kafka.
You should see an output similar to this:
After you integrated Seldon Core 2 with Kafka, you need to that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
To customize the settings you can add and modify the Kafka configuration using Helm, for example to add compression for producers.
Create a YAML file to specify the compression configuration for Seldon Core 2 runtime. For example, create the values-runtime-kafka-compression.yaml file. Use your preferred text editor to create and save the file with the following content:
Change to the directory that contains the values-runtime-kafka-compression.yaml file and then install Seldon Core 2 runtime in the namespace seldon-mesh.
If you are using a shared Kafka cluster with other applications, it is advisable to isolate topic names and consumer group IDs from other cluster users to prevent naming conflicts. This can be achieved by configuring the following two settings:
topicPrefix: set a prefix for all topics
consumerGroupIdPrefix: set a prefix for all consumer groups
Here's an example of how to configure topic name and consumer group ID isolation during a Helm installation for an application named myorg:
After you installed Seldon Core 2, and Kafka using Helm, you need to complete .
Learn how to run performance tests for Seldon Core 2 deployments, including load testing, benchmarking, and analyzing inference latency and throughput metrics.
This section describes how a user can run performance tests to understand the limits of a particular SCv2 deployment.
The base directly is tests/k6
k6 is used to drive requests for load, unload and infer workloads. It is recommended that the load test is run within the same cluster that has SCv2 installed as it requires internal access to some of the services that are not automatically exposed to the outside world. Furthermore having the driver withthin the same cluster minimises link latency to SCv2 entrypoint; therefore infer latencies are more representatives of actual overheads of the system.
Envoy Tests synchronous inference requests via envoy
To run: make deploy-envoy-test
Agent Tests inference requests direct to a specific agent, defaults to triton-0 or mlserver-0
To run: make deploy-rproxy-test pr make deploy-rproxy-mlserver-test
Server Tests inference requests direct to a specific server (bypassing agent), defaults to triton-0 or mlserver-0
to run: make deploy-server-test or deploy-server-mlserver-test
Pipeline gateway (HTTP-Kafka gateway) Tests inference requests to one-node pipeline HTTP and GPRC requests
To run: make deploy-kpipeline-test
Model gateway (Kafka-HTTP gateway) Tests inference requests to a model via kafka
To run: deploy-kmodel-test
One way to look at results is to look at the log of the pod that executed the kubernetes job.
Results can also be persisted to a gs bucket, a service account k6-sa-key in the same namespace is required,
Users can also look at the metrics that are exposed in prometheus while the test is underway
In the case a user is modifying the actual scenario of the test:
export DOCKERHUB_USERNAME=mydockerhubaccount
build the k6 image via make build-push
in the same shell environment, deploying jobs will use this custome built docker image
Users can modify settings of the tests in tests/k6/configs/k8s/base/k6.yaml. This will apply to all subsequent tests that are deployed using the above process.
Some settings that can be changed
k6 args
for a full list, check
Environment variables
for MODEL_TYPE, choose from:
Learn how to create and manage ML pipelines in Seldon Core using Kubernetes custom resources, including model chaining and tensor mapping.
Pipelines allow one to connect flows of inference data transformed by Model components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:
The steps list shows three models: tfsimple1, tfsimple2 and tfsimple3. These three models each take two tensors called INPUT0 and INPUT1 of integers. The models produce two outputs OUTPUT0 (the sum of the inputs) and OUTPUT1 (subtraction of the second input from the first).
tfsimple1 and tfsimple2 take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3 takes one V2 tensor input from each of the outputs of tfsimple1 and tfsimple2. As the outputs of tfsimple1 and tfsimple2 have tensors named OUTPUT0 and OUTPUT1 their names need to be changed to respect the expected input tensors and this is done with a tensorMap component providing this tensor renaming. This is only required if your models can not be directly chained together.
The output of the Pipeline is the output from the tfsimple3 model.
Seldon Core 2 supports cyclic pipelines, enabling the creation of feedback loops within the inference graph. However, the cyclic pipelines should be used carefully, as incorrect configurations can lead to unintended behavior.
The risk of joining messages from different iterations (i.e., a message from iteration t might be joined with messages from t-1, t-2, ..., 1). If a feedback message re-enters the pipeline within the join window and reaches a step already holding messages from a previous iteration, Kafka Streams may join messages across iterations. This can trigger unintended message propagation. For more details on how Kafka Streams handles joins and the implications for feedback loops, refer to this .
Seldon Core 2 provides a maxStepRevisits parameter in the pipeline manifest. This parameter limits the number of times a step can be revisited within a single pipeline execution. If the limit is reached, the pipeline execution will terminate, returning an error. This feature is useful in cyclic pipelines, where infinite loops might occur (e.g., in agentic workflows where control flow is determined by an LLM). It helps safeguard against unintended infinite loops. By default, the maxStepRevisits is set to 0 (i.e., no cycles), but you can adjust it according to your use case.
To enable a cyclic pipeline, set the allowCycles flag in your pipeline manifest:
The full GoLang specification for a Pipeline is shown below:
Learn how to configure Istio as an ingress controller for Seldon Core, including traffic management and security policies.
An ingress controller functions as a reverse proxy and load balancer, implementing a Kubernetes Ingress. It adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
Seldon Core 2 works seamlessly with any service mesh or ingress controller, offering flexibility in your deployment setup. This guide provides detailed instructions for installing and configuring Istio with Seldon Core 2.
Istio implements the Kubernetes ingress resource to expose a service and make it accessible from outside the cluster. You can install Istio in either a self-hosted Kubernetes cluster or a managed Kubernetes service provided by a cloud provider that is running the Seldon Core 2.
Install.
Ensure that you install a version of Istio that is compatible with your Kubernetes cluster version. For detailed information on supported versions, refer to the .
Installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2 involves these tasks:
Add the Istio Helm charts repository and update it:
Create the istio-system namespace where Istio components are installed:
Install the base component:
Install Istiod, the Istio control plane:
Install Istio Ingress Gateway:
Verify that Istio Ingress Gateway is installed:
This should return details of the Istio Ingress Gateway, including the external IP address.
Verify that all Istio Pods are running:
The output is similar to:
It is important to expose seldon-service service to enable communication between deployed machine learning models and external clients or services. The Seldon Core 2 inference API is exposed through the seldon-mesh service in the seldon-mesh namespace. If you install Core 2 in multiple namespaces, you need to expose the seldon-mesh service in each of namespace.
Verify if the seldon-mesh service is running for example, in the namespace seldon.
When the services are running you should see something similar to this:
Create a YAML file to create a VirtualService named iris-route in the namespace seldon-mesh. For example, create the seldon-mesh-vs.yaml file. Use your preferred text editor to create and save the file with the following content:
Additional Resources
This section covers various aspects of optimizing pipeline performance in Seldon Core 2, from testing methodologies to Core 2 configuration. Each subsection provides detailed guidance on different aspects of pipeline performance tuning:
Learn how to effectively test and optimize pipeline performance:
Understanding pipeline latency components
Identifying and reducing bottlenecks
Balancing model replicas and resources
Explore how to configure Core 2 components for optimal pipeline performance:
Understanding Core 2 data processing components
Optimizing Kafka integration
Configuring Core 2 services
Understand how Core 2 components scale with the number of deployed pipelines and models:
Dynamic scaling of dataflow engine, model gateway, and pipeline gateway
Loading and unloading of models and pipelines
Assignment of pipelines and models to replicas
Each of these aspects plays a crucial role in achieving optimal pipeline performance. We recommend starting with testing individual models in your pipeline, then using those insights to inform your Core 2 configuration and overall pipeline optimization strategies.
Learn how to deploy a cyclic pipeline using Core 2. In this example, you'll build a simple counter that begins at a user-defined starting value and increments by one until it reaches 10. If the starting value is already greater than 10, the pipeline terminates immediately without running.
Ensure that you have in the namespace seldon-mesh.
This guide walks you through setting up Jaeger Tracing for Seldon Core v2 on Kubernetes. By the end of this guide, you will be able to visualize inference traces through your Core 2 components.
Explore how Seldon Core 2 uses data flow paradigm and Kafka-based streaming to improve ML model deployment with better scalability, fault tolerance, and data observability.
Seldon Core 2 is designed around data flow paradigm. Here we will explain what that means and some of the rationals behind this choice.
Initial release of Seldon Core introduced a concept of an inference graph, which can be thought of as a sequence of operations that happen to the inference request. Here is how it may look like:
In reality though this was not how Seldon Core v1 is implemented. Instead, Seldon deployment consists of a range of independent services that host models, transformations, detectors and explainers, and a central orchestrator that knows the inference graph topology and makes service calls in the correct order, passing data between requests and responses as necessary. Here is how the picture looks under the hood:
!kubectl get servers -n seldon-meshNAME READY REPLICAS LOADED MODELS AGE
mlserver True 1 0 156d
mlserver-custom True 1 0 38d
triton True 1 0 156d!cat ../../../samples/quickstart/servers/mlserver-custom.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-custom
spec:
serverConfig: mlserver
capabilities:
- income-classifier-deps
podSpec:
containers:
- image: seldonio/mlserver:1.6.0
name: mlserver!kubectl apply -f ../../../samples/quickstart/servers/mlserver-custom.yaml -n seldon-meshserver.mlops.seldon.io/mlserver-custom unchanged!gcloud storage ls --recursive gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifiergs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/:
gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model-settings.json
gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model.joblib
Updates are available for some Google Cloud CLI components. To install them,
please run:
$ gcloud components update
To take a quick anonymous survey, run:
$ gcloud survey!gsutil cat gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier/model-settings.json {
"name": "income",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"uri": "./model.joblib",
"version": "v0.1.0"
}
}!cat ../../../samples/quickstart/models/sklearn-income-classifier.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-classifier
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.4.0/income-sklearn/classifier"
requirements:
- income-classifier-deps
memory: 100Ki!kubectl apply -f ../../../samples/quickstart/models/sklearn-income-classifier.yaml -n seldon-meshmodel.mlops.seldon.io/income-classifier createdMESH_IP = !kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP = MESH_IP[0]
MESH_IP'34.32.149.48'endpoint = f"http://{MESH_IP}/v2/models/income-classifier/infer"
headers = {
"Seldon-Model": "income-classifier",
}inference_request = {
"inputs": [
{
"name": "income",
"datatype": "INT64",
"shape": [1, 12],
"data": [53, 4, 0, 2, 8, 4, 2, 0, 0, 0, 60, 9]
}
]
}import requests
response = requests.post(endpoint, headers=headers, json=inference_request)
response.json(){'model_name': 'income-classifier_1',
'model_version': '1',
'id': '626ebe8e-bc95-433f-8f5f-ef296625622a',
'parameters': {},
'outputs': [{'name': 'predict',
'shape': [1, 1],
'datatype': 'INT64',
'parameters': {'content_type': 'np'},
'data': [0]}]}import re
import numpy as np
# Extracts numerical values from a formatted text and outputs a vector of numerical values.
def extract_numerical_values(input_text):
# Find key-value pairs in text
pattern = r'"[^"]+":\s*"([^"]+)"'
matches = re.findall(pattern, input_text)
# Extract numerical values
numerical_values = []
for value in matches:
cleaned_value = value.replace(",", "")
if cleaned_value.isdigit(): # Integer
numerical_values.append(int(cleaned_value))
else:
try:
numerical_values.append(float(cleaned_value))
except ValueError:
pass
# Return array of numerical values
return np.array(numerical_values)input_text = '''
"Age": "47",
"Workclass": "4",
"Education": "1",
"Marital Status": "1",
"Occupation": "1",
"Relationship": "0",
"Race": "4",
"Sex": "1",
"Capital Gain": "0",
"Capital Loss": "0",
"Hours per week": "68",
"Country": "9",
"Name": "John Doe"
'''
extract_numerical_values(input_text)array([47, 4, 1, 1, 1, 0, 4, 1, 0, 0, 68, 9])!gcloud storage ls --recursive gs://seldon-models/scv2/samples/preprocessorgs://seldon-models/scv2/samples/preprocessor/:
gs://seldon-models/scv2/samples/preprocessor/model-settings.json
gs://seldon-models/scv2/samples/preprocessor/model.py
gs://seldon-models/scv2/samples/preprocessor/preprocessor.yaml!cat ../../../samples/quickstart/models/preprocessor/preprocessor.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: preprocessor
spec:
storageUri: "gs://seldon-models/scv2/samples/preprocessor"
requirements:
- income-classifier-deps!kubectl apply -f ../../../samples/quickstart/models/preprocessor/preprocessor.yaml -n seldon-meshmodel.mlops.seldon.io/preprocessor createdendpoint_pp = f"http://{MESH_IP}/v2/models/preprocessor/infer"
headers_pp = {
"Seldon-Model": "preprocessor",
}text_inference_request = {
"inputs": [
{
"name": "text_input",
"shape": [1],
"datatype": "BYTES",
"data": [input_text]
}
]
}import requests
response = requests.post(endpoint_pp, headers=headers_pp, json=text_inference_request)
response.json(){'model_name': 'preprocessor_1',
'model_version': '1',
'id': 'b26e49d5-2a4c-488b-8dff-0df850fbed3d',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 12],
'datatype': 'INT64',
'parameters': {'content_type': 'np'},
'data': [47, 4, 1, 1, 1, 0, 4, 1, 0, 0, 68, 9]}]}!cat ../../../samples/quickstart/pipelines/income-classifier-app.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: income-classifier-app
spec:
steps:
- name: preprocessor
- name: income-classifier
inputs:
- preprocessor
output:
steps:
- income-classifier!kubectl apply -f ../../../samples/quickstart/pipelines/income-classifier-app.yaml -n seldon-meshpipeline.mlops.seldon.io/income-classifier-app createdpipeline_endpoint = f"http://{MESH_IP}/v2/models/income-classifier-app/infer"
pipeline_headers = {
"Seldon-Model": "income-classifier-app.pipeline"
}pipeline_response = requests.post(
pipeline_endpoint, json=text_inference_request, headers=pipeline_headers
)
pipeline_response.json(){'model_name': '',
'outputs': [{'data': [0],
'name': 'predict',
'shape': [1, 1],
'datatype': 'INT64',
'parameters': {'content_type': 'np'}}]}!kubectl delete -f ../../../samples/quickstart/pipelines/income-classifier-app.yaml -n seldon-mesh
!kubectl delete -f ../../../samples/quickstart/models/preprocessor/preprocessor.yaml -n seldon-mesh
!kubectl delete -f ../../../samples/quickstart/models/sklearn-income-classifier.yaml -n seldon-mesh
!kubectl delete -f ../../../samples/quickstart/servers/mlserver-custom.yaml -n seldon-meshpipeline.mlops.seldon.io "income-classifier-app" deleted
model.mlops.seldon.io "preprocessor" deleted
model.mlops.seldon.io "income-classifier" deleted
server.mlops.seldon.io "mlserver-custom" deletedapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: addmul10
spec:
default: pipeline-add10
resourceType: pipeline
candidates:
- name: pipeline-add10
weight: 50
- name: pipeline-mul10
weight: 50
apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-iris
spec:
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
Create a virtual service to expose the seldon-mesh service.
When the virtual service is created, you should see this:
# tests/k6/configs/k8s/base/k6.yaml
args: [
"--no-teardown",
"--summary-export",
"results/base.json",
"--out",
"csv=results/base.gz",
"-u",
"5",
"-i",
"100000",
"-d",
"120m",
"scenarios/infer_constant_vu.js",
]
# # infer_constant_rate
# args: [
# "--no-teardown",
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "scenarios/infer_constant_rate.js",
# ]
# # k8s-test-script
# args: [
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "scenarios/k8s-test-script.js",
# ]
# # core2_qa_control_plane_ops
# args: [
# "--no-teardown",
# "--verbose",
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "-u",
# "5",
# "-i",
# "10000",
# "-d",
# "9h",
# "scenarios/core2_qa_control_plane_ops.js",
# ] - name: INFER_HTTP_ITERATIONS
value: "1"
- name: INFER_GRPC_ITERATIONS
value: "1"
- name: MODELNAME_PREFIX
value: "tfsimplea,pytorch-cifar10a,tfmnista,mlflow-winea,irisa"
- name: MODEL_TYPE
value: "tfsimple,pytorch_cifar10,tfmnist,mlflow_wine,iris"
# Specify MODEL_MEMORY_BYTES using unit-of measure suffixes (k, M, G, T)
# rather than numbers without units of measure. If supplying "naked
# numbers", the seldon operator will take care of converting the number
# for you but also take ownership of the field (as FieldManager), so the
# next time you run the scenario creating/updating of the model CR will
# fail.
- name: MODEL_MEMORY_BYTES
value: "400k,8M,43M,200k,3M"
- name: MAX_MEM_UPDATE_FRACTION
value: "0.1"
- name: MAX_NUM_MODELS
value: "800,100,25,100,100"
# value: "0,0,25,100,100"
#
# MAX_NUM_MODELS_HEADROOM is a variable used by control-plane tests.
# It's the approximate number of models that can be created over
# MAX_NUM_MODELS over the experiment. In the worst case scenario
# (very unlikely) the HEADROOM values may temporarily exceed the ones
# specified here with the number of VUs, because each VU checks the
# headroom constraint independently before deciding on the available
# operations (no communication/sync between VUs)
# - name: MAX_NUM_MODELS_HEADROOM
# value: "20,5,0,20,30"
#
# MAX_MODEL_REPLICAS is used by control-plane tests. It controls the
# maximum number of replicas that may be requested when
# creating/updating models of a given type.
# - name: MAX_MODEL_REPLICAS
# value: "2,2,0,2,2"
#
- name: INFER_BATCH_SIZE
value: "1,1,1,1,1"
# MODEL_CREATE_UPDATE_DELETE_BIAS defines the probability ratios between
# the operations, for control-plane tests. For example, "1, 4, 3"
# makes an Update four times more likely then a Create, and a Delete 3
# times more likely than the Create.
# - name: MODEL_CREATE_UPDATE_DELETE_BIAS
# value: "1,3,1"
- name: WARMUP
value: "false"// tests/k6/components/model.js
import { dump as yamlDump } from "https://cdn.jsdelivr.net/npm/[email protected]/dist/js-yaml.mjs";
import { getConfig } from '../components/settings.js'
const tfsimple_string = "tfsimple_string"
const tfsimple = "tfsimple"
const iris = "iris" // mlserver
const pytorch_cifar10 = "pytorch_cifar10"
const tfmnist = "tfmnist"
const tfresnet152 = "tfresnet152"
const onnx_gpt2 = "onnx_gpt2"
const mlflow_wine = "mlflow_wine" // mlserver
const add10 = "add10" // https://github.com/SeldonIO/triton-python-examples/tree/master/add10
const sentiment = "sentiment" // mlserverkubectl apply -f seldon-mesh-vs.yamlvirtualservice.networking.istio.io/iris-route createdhelm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo updatekubectl create namespace istio-systemhelm install istio-base istio/base -n istio-systemhelm install istiod istio/istiod -n istio-system --waithelm install istio-ingressgateway istio/gateway -n istio-systemkubectl get svc istio-ingressgateway -n istio-systemkubectl get pods -n istio-systemNAME READY STATUS RESTARTS AGE
istiod-xxxxxxx-xxxxx 1/1 Running 0 2m
istio-ingressgateway-xxxxx 1/1 Running 0 2mkubectl get svc -n seldon-meshmlserver-0 ClusterIP None <none> 9000/TCP,9500/TCP,9005/TCP 43m
seldon-mesh LoadBalancer 34.118.225.130 34.90.213.15 80:32228/TCP,9003:31265/TCP 45m
seldon-pipelinegateway ClusterIP None <none> 9010/TCP,9011/TCP 45m
seldon-scheduler LoadBalancer 34.118.225.138 35.204.34.162 9002:32099/TCP,9004:32100/TCP,9044:30342/TCP,9005:30473/TCP,9055:32732/TCP,9008:32716/TCP 45m
triton-0 ClusterIP None <none> 9000/TCP,9500/TCP,9005/TCP apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: iris-route
namespace: seldon-mesh
spec:
gateways:
- istio-system/seldon-gateway
hosts:
- "*"
http:
- name: iris-http
match:
- uri:
prefix: /v2
route:
- destination:
host: seldon-mesh.seldon-mesh.svc.cluster.localISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"
Create a YAML file to specify the initial configuration.
Note: This configuration sets up a Kafka cluster with version 3.9.0. Ensure that you review the the supported versions of Kafka and update the version in the kafka.yaml file as needed. For more configuration examples, see this strimzi-kafka-operator.
Use your preferred text editor to create and save the file as kafka.yaml with the following content:
Check the name of the Kafka services in the namespace:
Restart the Pod:
Ensure that you are performing these steps in the directory where you have downloaded the samples.
Get the IP address of the Seldon Core 2 instance running with Istio:
Start by implementing the first model: a simple counter.
This model produces two output tensors. The first contains the incremented number, while the second is an empty tensor labeled either continue or stop. This second tensor acts as a trigger, directing the data flow through either the feedback loop or the output path. For more information on triggering tensors, see the intro to pipelines page.
Next, define the second model — an identity model:
The identity model simply passes the input tensors through to the output while introducing a delay. This delay is crucial for preventing infinite loops in the pipeline, which can occur due to the join interval behavior in Kafka Streams. For further details, see Kafka documentation.
This counter application pipeline consists of three models: the counter model, an identity model for the feedback loop, and another identity model for the output. The structure of the pipeline is illustrated as follows:
To deploy the pipeline, you need to load each model into the cluster. The model-settings.json configuration for the counter model is as follows:
For the identity feedback loop model, reuse the model-settings.json file and configure it to include a 1-millisecond delay:
The one-millisecond delay is crucial to prevent infinite loops in the pipeline. It aligns with the join window applied to all input types for the counter model, as well as the join window configured for the identity model, which is specified in the pipeline definition.
Similarly, for the identity output model, reuse the same model-settings.json file without introducing any delay.
The manifest files for the three models are the following:
To deploy the models, use the following command:
After the models are deployed, proceed to deploy the pipeline. The pipeline manifest file is defined as follows:
Note: that the joinWindowMs parameter is set to 1 millisecond for both the identity loop and identity output models. This setting is essential to prevent messages from different iterations from being joined (e.g., a message from iteration t being joined with messages from iterations t-1, t-2, ..., 1). Additionally, we limit the number of step revisits to 100 — the maximum number of times the pipeline can revisit a step during execution. While our pipeline behaves deterministically and is guaranteed to terminate, this parameter is especially useful in cyclic pipelines where a terminal state might not be reached (e.g., agentic workflows where control flow is determined by an LLM). It helps safeguard against infinite loops.
To deploy the pipeline, use the following command:
To send a request to the pipeline, use the following command:
This request initiates the pipeline with an input value of 0. The pipeline increments this value step by step until it reaches 10, at which point it stops. The response includes the final counter value, 10, along with a message indicating that the pipeline has terminated.
To clean up the models and the pipeline, use the following commands:
Install Helm, the package manager for Kubernetes.
Install Seldon Core 2
Install cert-manager in the namespace cert-manager.
To set up Jaeger Tracing for Seldon Core 2 on Kubernetes and visualize inference traces of the Seldon Core 2 components. You need to do the following:
Create a dedicated namespace to install the Jaeger Operator and tracing resources:
The Jaeger Operator manages Jaeger instances in the Kubernetes cluster. Use the Helm chart for Jaeger v2.
Add the Jaeger to the Helm repository:
Create a minimal tracing-values.yaml:
Install or upgrade the Jaeger Operator in the tracing namespace:
Validate that the Jaeger Operator Pod is running:
Output is similar to:
Install a simple Jaeger custom resource in the namespace seldon-mesh, where Seldon Core 2 is running.
Create a manifest file named jaeger-simplest.yaml with these contents:
Apply the manifest:
Verify that the Jaeger all-in-one pod is running:
Output is similar to:
This simplest Jaeger CR does the following:
All-in-one pod: Deploys a single pod running the collector, agent, query service, and UI, using in-memory storage.
Core 2 integration: receives spans from Seldon Core 2 components and exposes a UI for viewing traces.
To enable tracing, configure the OpenTelemetry exporter endpoint in the SeldonRuntime resource so that traces are sent to the Jaeger collector service created by the simplest Jaeger Custom Resource. The Seldon Runtime helm chart is located here.
Find the seldonruntime Custom Resource that needs to be updated using: kubectl get seldonruntimes -n seldon-mesh
Patch your Custom Resource to include tracingConfig under spec.config using:
Output is similar to:
Check the updated .yaml file, using: kubectl get seldonruntime seldon -n seldon-mesh -o yaml
Output is similar to:
Restart the following Core 2 component Pods so they pick up the new tracing configuration from the seldon-tracing ConfigMap in the seldon-mesh namespace.
seldon-dataflow-engine
seldon-pipeline-gateway
seldon-model-gateway
seldon-scheduler
Servers
After restart, these components reads the updated tracing config and start emitting traces to Jaeger.
To visualize traces, send requests to your models or pipelines deployed in Seldon Core 2. Each inference request should produce a trace that shows the path through the Core 2 components such as gateways, dataflow engine, server agents in the Jaeger UI.
Port-forward the Jaeger query service to your local machine:
Open the Jaeger UI in your browser:
You can now explore traces emitted by Seldon Core 2 components.
An example Jaeger trace is shown below:
While this is a convenient way of implementing evaluation graph with microservices, it has a few problems. Orchestrator becomes a bottleneck and a single point of failure. It also hides all the data transformations that need to happen to translate one service's response to another service's request. Data tracing and lineage becomes difficult. All in all, while Seldon platform is all about processing data, under-the-hood implementation was still focused on order of operations and not on data itself.
The realisation of this disparity led to a new approach towards inference graph evaluation in v2, based on the data flow paradigm. Data flow is a well known concept in software engineering, known from 1960s. In contrast to services, that model programs as a control flow, focusing on the order of operations, data flow proposes to model software systems as a series of connections that modify incoming data, focusing on data flowing through the system. A particular flavor of data flow paradigm used by v2 is known as flow-based programming, FBP. FBP defines software applications as a set of processes which exchange data via connections that are external to those processes. Connections are made via named ports, which promotes data coupling between components of the system.
Data flow design makes data in software the top priority. That is one of the key messages of the so called "data-centric AI" idea, which is becoming increasingly popular within the ML community. Data is a key component of a successful ML project. Data needs to be discovered, described, cleaned, understood, monitored and verified. Consequently, there is a growing demand for data-centric platforms and solutions. Making Seldon Core data-centric was one of the key goals of the Seldon Core 2 design.
In the context of Seldon Core application of FBP design approach means that the evaluation implementation is done the same way inferece graph. So instead of routing everything through a centralized orchestrator the evaluation happens in the same graph-like manner:
As far as implementation goes, Seldon Core 2 runs on Kafka. Inference request is put onto a pipeline input topic, which triggers an evaluation. Each part of the inference graph is a service running in its own container fronted by a model gateway. Model gateway listens to a corresponding input Kafka topic, reads data from it, calls the service and puts the received response to an output Kafka topic. There is also a pipeline gateway that allows to interact with Seldon Core in synchronous manner.
This approach gives SCv2 several important features. Firstly, Seldon Core natively supports both synchronous and asynchronous modes of operation. Asynchronicity is achieved via streaming: input data can be sent to an input topic in Kafka, and after the evaluation the output topic will contain the inference result. For those looking to use it in the v1 style, a service API is provided.
Secondly, there is no single point of failure. Even if one or more nodes in the graph go down, the data will still be sitting on the streams waiting to be processed, and the evaluation resumes whenever the failed node comes back up.
Thirdly, data flow means intermediate data can be accessed at any arbitrary step of the graph, inspected and collected as necessary. Data lineage is possible throughout, which opens up opportunities for advanced monitoring and explainability use cases. This is a key feature for effective error surfacing in production environments as it allows:
Adding context from different parts of the graph to better understand a particular output
Reducing false positive rates of alerts as different slices of the data can be investigated
Enabling reproducing of results as fined-grained lineage of computation and associated data transformation are tracked by design
Finally, inference graph can now be extended with adding new nodes at arbitrary places, all without affecting the pipeline execution. This kind of flexibility was not possible with v1. This also allows multiple pipelines to share common nodes and therefore optimising resources usage.
More details and information on data-centric AI and data flow paradigm can be found in these resources:
Stanford MLSys seminar "What can Data-Centric AI Learn from Data and ML Engineering?"
A paper that explores data flow in ML deployment context
Introduction to flow based programming from its creator J.P. Morrison:
Pathways: Asynchronous Distributed Dataflow for from Google on the design and implementation of data flow based orchestration layer for accelerators
Better understanding of data requires tracking its
Guide to integrating managed Kafka services (Confluent Cloud, Amazon MSK, Azure Event Hub) with Seldon Core 2, including security configurations and authentication setup.
Seldon recommends a managed Kafka service for production installation. You can integrate and secure your managed Kafka Seldon Core 2.
Some of the Managed Kafka services that are tested are:
Confluent Cloud (security: SASL/PLAIN)
Confluent Cloud (security: SASL/OAUTHBEARER)
Learn how to implement request-per-second (RPS) based autoscaling in Seldon Core 2 using Kubernetes HPA and Prometheus metrics.
Given Seldon Core 2 is predominantly for serving ML in Kubernetes, it is possible to leverage HorizontalPodAutoscaler or to define scaling logic automatically scale up and down Kubernetes resources. This requires exposing metrics such that they can be used by HPA. In this tutorial, we will explain how to expose a metric (requests per second) using Prometheus and , such that it can be used to autoscale Models or Servers using HPA.
The following workflow will require:
Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring namespace.
Learn how to leverage Core 2's native autoscaling functionality for Servers
Core 2 runs with long-lived server replicas, each able to host multiple models (through Multi-Model Serving, or MMS). The server replicas can be autoscaled natively by Core 2 in response to dynamic changes in the requested number of model replicas, allowing users to seamlessly optimize the infrastructure cost associated with their deployments.
This document outlines the autoscaling policies and mechanisms that are available for autoscaling server replicas. These policies are designed to ensure that the server replicas are scaled up or down in response to changes in the number of replicas requested for each model. In other words if a given model is scaled up, the system will scale up the server replicas in order to host the new model replicas. Similarly, if a given model is scaled down, the system may scale down the number of replicas of the server hosting the model, depending on other models that are loaded on the same server replica.
kubectl get svc -n seldon-meshkubectl delete pod <seldon-dataflow-engine> -n seldon-mesh kubectl create namespace seldon-mesh || echo "namespace seldon-mesh exists"helm repo add strimzi https://strimzi.io/charts/
helm repo updatehelm install strimzi-kafka-operator strimzi/strimzi-kafka-operator --namespace seldon-meshapiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: seldon
namespace: seldon-mesh
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
replicas: 3
version: 3.9.0
listeners:
- name: plain
port: 9092
tls: false
type: internal
- name: tls
port: 9093
tls: true
type: internal
config:
processMode: kraft
auto.create.topics.enable: true
default.replication.factor: 1
inter.broker.protocol.version: 3.7
min.insync.replicas: 1
offsets.topic.replication.factor: 1
transaction.state.log.min.isr: 1
transaction.state.log.replication.factor: 1
entityOperator: nullkubectl apply -f kafka.yaml -n seldon-meshapiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: kafka
namespace: seldon-mesh
labels:
strimzi.io/cluster: seldon
spec:
replicas: 3
roles:
- broker
- controller
resources:
requests:
cpu: '500m'
memory: '2Gi'
limits:
memory: '2Gi'
template:
pod:
tmpDirSizeLimit: 1Gi
storage:
type: jbod
volumes:
- id: 0
type: ephemeral
sizeLimit: 500Mi
kraftMetadata: shared
- id: 1
type: persistent-claim
size: 10Gi
deleteClaim: falsekubectl apply -f kafka-nodepool.yaml -n seldon-meshkubectl get pods -n seldon-mesh```bash
NAME READY STATUS RESTARTS AGE
hodometer-5489f768bf-9xnmd 1/1 Running 0 25m
mlserver-0 3/3 Running 0 24m
seldon-dataflow-engine-75f9bf6d8f-2blgt 1/1 Running 5 (23m ago) 25m
seldon-envoy-7c764cc88-xg24l 1/1 Running 0 25m
seldon-kafka-0 1/1 Running 0 21m
seldon-kafka-1 1/1 Running 0 21m
seldon-kafka-2 1/1 Running 0 21m
seldon-modelgateway-54d457794-x4nzq 1/1 Running 0 25m
seldon-pipelinegateway-6957c5f9dc-6blx6 1/1 Running 0 25m
seldon-scheduler-0 1/1 Running 0 25m
seldon-v2-controller-manager-7b5df98677-4jbpp 1/1 Running 0 25m
strimzi-cluster-operator-66b5ff8bbb-qnr4l 1/1 Running 0 23m
triton-0 3/3 Running 0 24m
```kubectl logs <seldon-dataflow-engine> -n seldon-meshWARN [main] org.apache.kafka.clients.ClientUtils : Couldn't resolve server seldon-kafka-bootstrap.seldon-mesh:9092 from bootstrap.servers as DNS resolution failed for seldon-kafka-bootstrap.seldon-meshkubectl get configmaps -n seldon-meshNAME DATA AGE
kube-root-ca.crt 1 38m
seldon-agent 1 30m
seldon-kafka 1 30m
seldon-kafka-0 6 26m
seldon-kafka-1 6 26m
seldon-kafka-2 6 26m
seldon-manager-config 1 30m
seldon-tracing 4 30m
strimzi-cluster-operator 1 28mkubectl get configmap seldon-kafka -n seldon-mesh -o yamlapiVersion: v1
data:
kafka.json: '{"bootstrap.servers":"seldon-kafka-bootstrap.seldon-mesh:9092","consumer":{"auto.offset.reset":"earliest","message.max.bytes":"1000000000","session.timeout.ms":"6000","topic.metadata.propagation.max.ms":"300000"},"producer":{"linger.ms":"0","message.max.bytes":"1000000000"},"topicPrefix":"seldon"}'
kind: ConfigMap
metadata:
creationTimestamp: "2024-12-05T07:12:57Z"
name: seldon-kafka
namespace: seldon-mesh
ownerReferences:
- apiVersion: mlops.seldon.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: SeldonRuntime
name: seldon
uid: 9e724536-2487-487b-9250-8bcd57fc52bb
resourceVersion: "778"
uid: 5c041e69-f36b-4f14-8f0d-c8790003cb3ehelm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
-f values-runtime-kafka-compression.yaml \
--installhelm upgrade --install seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace seldon-mesh \
--set controller.clusterwide=true \
--set kafka.topicPrefix=myorg \
--set kafka.consumerGroupIdPrefix=myorgcurl -k http://<INGRESS_IP>:80/v2/models/counter-pipeline/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: counter-pipeline.pipeline" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "counter-pipeline.inputs",
"datatype": "INT32",
"shape": [1],
"data": [0]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
10
],
"name": "output",
"shape": [
1,
1
],
"datatype": "INT32",
"parameters": {
"content_type": "np"
}
},
{
"data": [
""
],
"name": "stop",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "str"
}
}
]
}seldon pipeline infer counter-pipeline --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"counter-pipeline.inputs","shape":[1],"datatype":"INT32","data":[0]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
10
],
"name": "output",
"shape": [
1,
1
],
"datatype": "INT32",
"parameters": {
"content_type": "np"
}
},
{
"data": [
""
],
"name": "stop",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "str"
}
}
]
}ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"from mlserver.model import MLModel
from mlserver.codecs import NumpyCodec, StringCodec
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.logging import logger
class Counter(MLModel):
async def load(self) -> bool:
self.ready = True
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
x = NumpyCodec.decode_input(payload.inputs[0]) + 1
message = "continue" if x.item() < 10 else "stop"
return InferenceResponse(
model_name=self.name,
model_version=self.version,
outputs=[
NumpyCodec.encode_output(
name="output",
payload=x
),
StringCodec.encode_output(
name=message,
payload=[""]
),
]
)import asyncio
from mlserver.logging import logger
from mlserver import MLModel, ModelSettings
from mlserver.types import (
InferenceRequest, InferenceResponse, ResponseOutput
)
class IdentityModel(MLModel):
def __init__(self, settings: ModelSettings):
super().__init__(settings)
self.params = settings.parameters
self.extra = self.params.extra if self.params is not None else None
self.delay = self.extra.get("delay", 0)
async def load(self) -> bool:
self.ready = True
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
if self.delay:
await asyncio.sleep(self.delay)
return InferenceResponse(
model_name=self.name,
model_version=self.version,
outputs=[
ResponseOutput(
name=request_input.name,
shape=request_input.shape,
datatype=request_input.datatype,
parameters=request_input.parameters,
data=request_input.data
) for request_input in payload.inputs
]
){
"name": "counter",
"implementation": "model.Counter",
"parameters": {
"version": "v0.1.0"
}
}{
"name": "identity-loop",
"implementation": "model.IdentityModel",
"parameters": {
"version": "v0.1.0",
"extra": {
"delay": 0.001
}
}
}{
"name": "identity-output",
"implementation": "model.IdentityModel",
"parameters": {
"version": "v0.1.0"
}
}cat ./models/counter.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: counter
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/counter"
requirements:
- mlservercat ./models/identity-loop.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: identity-loop
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-loop"
requirements:
- mlservercat ./models/identity-output.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: identity-output
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-output"
requirements:
- mlserverkubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: counter
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/counter"
requirements:
- mlserver
EOFkubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: identity-loop
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-loop"
requirements:
- mlserver
EOFkubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: identity-output
spec:
storageUri: "gs://seldon-models/scv2/examples/cyclic-pipeline/identity-output"
requirements:
- mlserver
EOFkubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh
model.mlops.seldon.io/counter condition met
model.mlops.seldon.io/identity-loop condition met
model.mlops.seldon.io/identity-output condition metapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: counter-pipeline
spec:
allowCycles: true
maxStepRevisits: 100
steps:
- name: counter
inputsJoinType: any
inputs:
- counter-pipeline.inputs
- identity-loop.outputs
- name: identity-output
joinWindowMs: 1
inputs:
- counter.outputs
triggers:
- counter.outputs.stop
- name: identity-loop
joinWindowMs: 1
inputs:
- counter.outputs.output
triggers:
- counter.outputs.continue
output:
steps:
- identity-output.outputskubectl create -f counter-pipeline.yaml -n seldon-meshpipeline.mlops.seldon.io/counter-pipeline createdkubectl delete pipeline counter-pipeline -n seldon-meshpipeline.mlops.seldon.io "counter-pipeline" deletedkubectl delete model identity-loop -n seldon-mesh
kubectl delete model identity-output -n seldon-mesh
kubectl delete model counter -n seldon-meshmodel.mlops.seldon.io "identity-loop" deleted
model.mlops.seldon.io "identity-output" deleted
model.mlops.seldon.io "counter" deletedkubectl create namespace tracinghelm repo add jaegertracing s://jaegertracing.github.io/helm-charts
helm repo updaterbac:
clusterRole: true
create: true
pspEnabled: falsehelm upgrade tracing jaegertracing/jaeger-operator \
--version 2.57.0 \
-f tracing-values.yaml \
-n tracing \
--installkubectl get pods -n tracingNAME READY STATUS RESTARTS AGE
tracing-jaeger-operator-549b79b848-h4p4d 1/1 Running 0 96sapiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
namespace: seldon-meshkubectl apply -f jaeger-simplest.yamlkubectl get pods -n seldon-mesh | grep simplestNAME READY STATUS RESTARTS AGE
simplest-8686f5d96-4ptb4 1/1 Running 0 45skubectl patch seldonruntime seldon -n seldon-mesh \
--type merge \
-p '{"spec":{"config":{"kafkaConfig":{"bootstrap.servers":"seldon-kafka-bootstrap.seldon-mesh:9092","consumer":{"auto.offset.reset":"earliest"},"topics":{"numPartitions":4}},"scalingConfig":{"servers":{}},"tracingConfig":{"otelExporterEndpoint":"simplest-collector.seldon-mesh:4317"}}}}'seldonruntime.mlops.seldon.io/seldon patchedspec:
config:
agentConfig:
rclone: {}
kafkaConfig:
bootstrap.servers: seldon-kafka-bootstrap.seldon-mesh:9092
consumer:
auto.offset.reset: earliest
topics:
numPartitions: 4
scalingConfig:
servers: {}
serviceConfig: {}
tracingConfig:
otelExporterEndpoint: simplest-collector.seldon-mesh:4317kubectl port-forward svc/simplest-query -n seldon-mesh 16686:16686http://localhost:16686apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline
spec:
allowCycles: true
maxStepRevisits: 100
...type PipelineSpec struct {
// External inputs to this pipeline, optional
Input *PipelineInput `json:"input,omitempty"`
// The steps of this inference graph pipeline
Steps []PipelineStep `json:"steps"`
// Synchronous output from this pipeline, optional
Output *PipelineOutput `json:"output,omitempty"`
// Dataflow specs
Dataflow *DataflowSpec `json:"dataflow,omitempty"`
// Allow cyclic pipeline
AllowCycles bool `json:"allowCycles,omitempty"`
// Maximum number of times a step can be revisited
MaxStepRevisits uint32 `json:"maxStepRevisits,omitempty"`
}
type DataflowSpec struct {
// Flag to indicate whether the kafka input/output topics
// should be cleaned up when the model is deleted
// Default false
CleanTopicsOnDelete bool `json:"cleanTopicsOnDelete,omitempty"`
}
// +kubebuilder:validation:Enum=inner;outer;any
type JoinType string
const (
// data must be available from all inputs
JoinTypeInner JoinType = "inner"
// data will include any data from any inputs at end of window
JoinTypeOuter JoinType = "outer"
// first data input that arrives will be forwarded
JoinTypeAny JoinType = "any"
)
type PipelineStep struct {
// Name of the step
Name string `json:"name"`
// Previous step to receive data from
Inputs []string `json:"inputs,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
// Triggers required to activate step
Triggers []string `json:"triggers,omitempty"`
// +kubebuilder:default=inner
InputsJoinType *JoinType `json:"inputsJoinType,omitempty"`
TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
// Batch size of request required before data will be sent to this step
Batch *PipelineBatch `json:"batch,omitempty"`
}
type PipelineBatch struct {
Size *uint32 `json:"size,omitempty"`
WindowMs *uint32 `json:"windowMs,omitempty"`
Rolling bool `json:"rolling,omitempty"`
}
type PipelineInput struct {
// Previous external pipeline steps to receive data from
ExternalInputs []string `json:"externalInputs,omitempty"`
// Triggers required to activate inputs
ExternalTriggers []string `json:"externalTriggers,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
// +kubebuilder:default=inner
JoinType *JoinType `json:"joinType,omitempty"`
// +kubebuilder:default=inner
TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
}
type PipelineOutput struct {
// Previous step to receive data from
Steps []string `json:"steps,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs uint32 `json:"joinWindowMs,omitempty"`
// +kubebuilder:default=inner
StepsJoin *JoinType `json:"stepsJoin,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
}
To enable autoscaling of server replicas, the following requirements need to be met:
Setting minReplicas and maxReplicas in the Server CR. This will define the minimum and maximum number of server replicas that can be created.
Setting the autoscaling.autoscalingServerEnabled value to true (default) during installation of the Core 2 seldon-core-v2-setup helm chart. If not installing via helm, setting the ENABLE_SERVER_AUTOSCALING environment variable to true in the seldon-scheduler podSpec (via either SeldonConfig or a SeldonRuntime podSpec override) will have the same effect. This will enable the autoscaling of server replicas.
An example of a Server CR with autoscaling enabled is shown below:
When we want to scale up the number of replicas for a model, the associated servers might not have enough capacity (replicas) available. In this case we need to scale up the server replicas to match the number required by our models. There is currently only one policy for scaling up server replicas, and that is via Model Replica Count. This policy scales up the server replicas to match the number of model replicas that are required. In other words, if a model is scaled up, the system will scale up the server replicas to host these models. This policy ensures that the server replicas are scaled up in response to changes in the number of model replicas that are required. During the scale up process, the system will create new server replicas to host the new model replicas. The new server replicas will be created with the same configuration as the existing server replicas. This includes the server configuration, resources, etc. The new server replicas will be added to the existing server replicas and will be used to host the new model replicas.
There is a period of time where the new server replicas are being created and the new model replicas are being loaded onto these server replicas. During this period, the system will ensure that the existing server replicas are still serving load so that there is no downtime during the scale up process. This is achieved by using partial scheduling of the new model replicas onto the new server replicas. This ensures that the new server replicas are gradually loaded with the new model replicas and that the existing server replicas are still serving load. Check the Partial Scheduling document for more details.
Once we have scaled down the number of replicas for a model, some of the corresponding server replicas might be left unused (depending on whether those replicas are hosting other models or not). If that is the case, the extra server pods might incur unnecessary infrastructure cost (especially if they have expensive resources such as GPUs attached). Scaling down servers in sync with models is not straightforward in the case of Multi-Model Serving. Scaling down one model does not necessarily mean that we also need to scale down the corresponding Server replica as this replica might be still serving load for other models. Therefore we define heuristics that can be used to scale down servers if we think that they are not properly used, described in the policies below.
There are two possible policies we use to define the scale down of Servers:
Empty Server Replica: In the simplest case we can remove a server replica if it does not host any models. This guarantees that there is no load on a particular server replica before removing it. This policy works best in the case of single model serving where the server replicas are only hosting a single model. In this case, if the model is scaled down, the server replica will be empty and can be removed.
However in the case of MMS, only reducing the number of server replicas when one of the replicas no longer hosts any models can lead to a suboptimal packing of models onto server replicas. This is because the system will not automatically pack models onto the smaller set of replicas. This can lead to more server replicas being used than necessary. This can be mitigated by the lightly loaded server replicas policy.
Lightly Loaded Server Replicas (Experimental):
Warning: This policy is experimental and is not enabled by default. It can be enabled by setting autoscaling.serverPackingEnabled to true and autoscaling.serverPackingPercentage to a value between 0 and 100. This policy is still under development and might in some cases increase latencies, so it's worth testing ahead of time to observe behavior for a given setup.
Using the above policy which MMS enabled, different model replicas will be hosted on potentially different server replicas and as we scale these models up and down the system can end up in a situation where the models are not consolidated to an optimized number of servers. For illustration, take the case of 3 Models: , and . We have 1 server with 2 replicas: and that can host these 3 models. Assuming that initially we have and with 1 replica and with 2 replicas therefore the assignment is:
Initial assignment:
: ,
: ,
Now if the user unloads Model the assignment is:
:
:
There is an argument that this is might not be optimized and in MMS the assignment could be:
: ,
: removed
As the system evolves this imbalance can get larger and could cause the serving infrastructure to be less optimized. The behavior above is actually not limited to autoscaling, however autoscaling will aggravate the issue causing more imbalance over time. This imbalance can be mitigated by making the following observation: If the max number of replicas of any given model (assigned to a server from a logical point of view) is less than the number of replicas for this server, then we can pack the models hosted onto a smaller set of replicas. Note in Core 2 a server replica can host only 1 replica of a given model.
In other words, consider the following example - for models and having 2 replicas each and we have 3 server replicas, the following assignment is not potentially optimized:
: ,
:
:
In this case we could trigger removal of for the server which could pack the models more appropriately
: ,
: ,
: removed
While this heuristic is going to pack models onto a set of fewer replicas, which allows us to scale models down, there is still the risk that the packing could increase latencies, trigger a later scale up. Core 2 ensures consistent behavior without reverting between states. The user can also reduce the number of packing events by setting autoscaling.serverPackingPercentage to a lower value.
Currently Core 2 triggers the packing logic only when there is model replica being removed, either from a model scale down or a model being deleted. In the future, the logic might be triggered more frequently to improve model packing onto a smaller set of replicas.
apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: sklearn-mirror
spec:
default: iris
candidates:
- name: iris
weight: 100
mirror:
name: iris2
percent: 100
apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: addmul10-mirror
spec:
default: pipeline-add10
resourceType: pipeline
candidates:
- name: pipeline-add10
weight: 100
mirror:
name: pipeline-mul10
percent: 100
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver
namespace: seldon
spec:
replicas: 2
minReplicas: 1
maxReplicas: 4
serverConfig: mlserverAmazon MSK (security: SASL/SCRAM)
Azure Event Hub (security: SASL/PLAIN)
These instructions outline the necessary configurations to integrate managed Kafka services with Seldon Core 2.
You can secure Seldon Core 2 integration with managed Kafka services using:
In production settings, always set up TLS encryption with Kafka. This ensures that neither the credentials nor the payloads are transported in plaintext.
When TLS is enabled, the client needs to know the root CA certificate used to create the server's certificate. This is used to validate the certificate sent back by the Kafka server.
Create a certificate named ca.crt that is encoded as a PEM certificate. It is important that the certificate is saved as ca.crt. Otherwise, Seldon Core 2 may not be able to find the certificate. Within the cluster, you can provide the server's root CA certificate through a secret. For example, a secret named kafka-broker-tls with a certificate.
In production environments, Kafka clusters often require authentication, especially when using managed Kafka solutions. Therefore, when installing Seldon Core 2 components, it is crucial to provide the correct credentials for a secure connection to Kafka.
The type of authentication used with Kafka varies depending on the setup but typically includes one of the following:
Simple Authentication and Security Layer (SASL): Requires a username and password.
Mutual TLS (mTLS): Involves using SSL certificates as credentials.
OAuth 2.0: Uses the client credential flow to acquire a JWT token.
These credentials are stored as Kubernetes secrets within the cluster. When setting up Seldon Core 2 you must create the appropriate secret in the correct format and update the components-values.yaml, and install-values files respectively.
When you use SASL as the authentication mechanism for Kafka, the credentials consist of a username and password pair. The password is supplied through a secret.
Create a password for Seldon Core 2 in the namespace seldon-mesh.
Values in Seldon Core 2
In Seldon Core 2 you need to specify these values in
security.kafka.sasl.mechanism - SASL security mechanism, e.g. SCRAM-SHA-512
security.kafka.sasl.client.username - Kafka username
security.kafka.sasl.client.secret - Created secret with password
The resulting set of values to include in values.yaml is similar to:
The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.
When you use OAuth 2.0 as the authentication mechanism for Kafka, the credentials consist of a Client ID and Client Secret, which are used with your Identity Provider to obtain JWT tokens for authenticating with Kafka brokers.
Create a Kubernetes secret kafka-oauth.yaml file.
Store the secret in the seldon-mesh namespace to configure with Seldon Core 2.
This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.
When you use mTLS as authentication mechanism Kafka uses a set of certificates to authenticate the client.
A client certificate, referred to as tls.crt.
A client key, referred to as tls.key.
Here are some examples to create secrets for managed Kafka services such as Azure Event Hub, Confluent Cloud(SASL), Confluent Cloud(OAuth2.0).
Prerequisites:
You must use at least the Standard tier for your Event Hub namespace because the Basic tier does not support the Kafka protocol.
Seldon Core 2 creates two Kafka topics for each model and pipeline, plus one global topic for errors. This results in a total number of topics calculated as: 2 x (number of models + number of pipelines) + 1. This topic count is likely to exceed the limit of the Standard tier in Azure Event Hub. For more information, see quota information.
Creating a namespace and obtaining the connection string
These are the steps that you need to perform in Azure Portal.
Create an Azure Event Hub namespace. You need to have an Azure Event Hub namespace. Follow the to create one. Note: You do not need to create individual Event Hubs (topics) as Seldon Core 2 automatically creates all necessary topics.
Connection string for Kafka Integration. To connect to the Azure Event Hub using the Kafka API, you need to obtain Kafka endpoint and Connection string. For more information, see
Note: Ensure you get the Connection string at the namespace level, as it is needed to dynamically create new topics. The format of the Connection string should be:
Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.
Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.
Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon-system. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.
Creating API Keys
These are the steps that you need to perform in Confluent Cloud.
Navigate to Clients > New client and choose a client, for example GO and generate new Kafka cluster API key. For more information, see .
Confluent generates a configuration file with the details.
Save the values of Key, Secret
Confluent Cloud managed Kafka supports OAuth 2.0 to authenticate your Kafka clients. See Confluent Cloud documentation for further details.
Configuring Identity Provider
In Confluent Cloud Console Navigate to Account & Access / Identity providers and complete these steps:
register your Identity Provider. See Confluent Cloud documentation for further details.
add new identity pool to your newly registered Identity Provider. See Confluent Cloud documentation for further details.
To integrate Kafka with Seldon Core 2.
Update the initial configuration.
Update the initial configuration for Seldon Core 2 in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:
To enable Kafka Encryption (TLS) you need to reference the secret that you created in the security.kafka.ssl.client.secret field of the Helm chart values. The resulting set of values to include in components-values.yaml is similar to:
Change to the directory that contains the components-values.yaml file and then install Seldon Core 2 operator in the namespace seldon-system.
After you integrated Seldon Core 2 with Kafka, you need to Install an Ingress Controller that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics
Configuring HPA manifests to scale Models
Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from prometheus-adapter, it will need to be removed before being able to scale Core 2 models and servers via HPA. The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.
The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.
To install through helm:
These commands install prometheus-adapter as a helm release named hpa-metrics in the same namespace where Prometheus is installed, and point to its service URL (without the port).
The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.
If you are running Prometheus on a different port than the default 9090, you can also pass --set prometheus.port=[custom_port] You may inspect all the options available as helm values by running helm show values prometheus-community/prometheus-adapter
Please check that the metricsRelistInterval helm value (default to 1m) works well in your setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus scrape interval. The corresponding prometheus adapter command-line argument is --metrics-relist-interval. If the relist interval is set incorrectly, it will lead to some of the custom metrics being intermittently reported as missing.
We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a ConfigMap in the same namespace as itself, named [helm_release_name]-prometheus-adapter. In our case, it will be hpa-metrics-prometheus-adapter.
Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.
Change the name if you've chosen a different value for the prometheus-adapter helm release name. Change the namespace to match the namespace where prometheus-adapter is installed.
In this example, a single rule is defined to fetch the seldon_model_infer_total metric from Prometheus, compute its per second change rate based on data within a 2 minute sliding window, and expose this to Kubernetes as the infer_rps metric, with aggregations available at model, server, inference server pod and namespace level.
When HPA requests the infer_rps metric via the custom metrics API for a specific model, prometheus-adapter issues a Prometheus query in line with what it is defined in its config.
For the configuration in our example, the query for a model named irisa0 in namespace seldon-mesh would be:
You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate() with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window before query time are used to compute the RPS.
It is important to sanity-check the query by executing it against your Prometheus instance. To do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval (Prometheus default 1 minute), so that two values from the series are captured and a rate can be computed. Finally, you can modify the model name and namespace in the query above to match the model you've picked and execute the query.
If the query result is empty, please adjust it until it consistently returns the expected metric values. Pay special attention to the window size (2 minutes in the example): if it is smaller than twice the Prometheus scrape interval, the query may return no results. A compromise needs to be reached to set the window size large enough to reject noise but also small enough to make the result responsive to quick changes in load.
Update the metricsQuery in the prometheus-adapter ConfigMap to match any query changes you have made during tests.
A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.
The rule definition can be broken down in four parts:
Discovery (the seriesQuery and seriesFilters keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.
As an alternative to the example above, all the Seldon Prometheus metrics of the form seldon_model.*_total could be considered, followed by excluding metrics pre-aggregated across all models (.*_aggregate_.*) as well as the cummulative infer time per model (.*_seconds_total):
For RPS, we are only interested in the model inference count (seldon_model_infer_total)
Association (the resources key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.
The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"} lets prometheus-adapter know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io CR.
One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris or for servers.mlops.seldon.io/mlserver.
The labels that do not refer to a namespace resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.
Naming (the name key) configures the naming of the k8s custom metric.
In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total and expose custom metric endpoints named infer_rps, which when called return the result of a query over the Prometheus metric. Instead of a literal match, one could also use regex group capture expressions, which can then be referenced in the custom metric name:
Querying (the metricsQuery key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.
The query can make use of the following placeholders:
For a complete reference for how prometheus-adapter can be configured via the ConfigMap, please consult the docs here.
Once you have applied any necessary customizations, replace the default prometheus-adapter config with the new one, and restart the deployment (this restart is required so that prometheus-adapter picks up the new config):
In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API
Listing the available metrics:
For namespaced metrics, the general template for fetching is:
For example:
Fetching model RPS metric for a specific (namespace, model) pair (seldon-mesh, irisa0):
Fetching model RPS metric aggregated at the (namespace, server) level (seldon-mesh, mlserver):
Fetching model RPS metric aggregated at the (namespace, pod) level (seldon-mesh, mlserver-0):
Fetching the same metric aggregated at namespace level (seldon-mesh):
Once metrics are exposed properly, users can use HPA to trigger autoscaling of Kubernetes resources, including custom resources such as Seldon Core's Models and Servers. Implementing autoscaling with HPA for Models, or for Models and Servers together is explained in the following pages.
This example runs you through a series of batch inference requests made to both models and pipelines running on Seldon Core locally.
Deprecated: The MLServer CLI infer feature is experimental and will be removed in future work.
If you haven't already, you'll need to before you run through this example.
First, let's jump in to the samples folder where we'll find some sample models and pipelines
we can use:
Let's take a look at a sample model before we deploy it:
The above manifest will deploy a simple model based on the .
Let's now deploy that model using the Seldon CLI:
Now that we've deployed our iris model, let's create a around the model.
We see that this pipeline only has one step, which is to call the iris model we deployed
earlier. We can create the pipeline by running:
To demonstrate batch inference requests to different types of models, we'll also deploy a simple model:
The tensorflow model takes two arrays as inputs and returns two arrays as outputs. The first output is the addition of the two inputs and the second output is the value of (first input - second input).
Let's deploy the model:
Just as we did for the scikit-learn model, we'll deploy a simple pipeline for our tensorflow model:
Inspect the pipeline manifest:
and deploy it:
Once we've deployed a model or pipeline to Seldon Core, we can list them and check their status by running:
and
Your models and pieplines should be showing a state of ModelAvailable and PipelineReady respectively.
Before we run a large batch job of predictions through our models and pipelines, let's quickly
check that they work with a single standalone inference request. We can do this using theseldon model infer command.
The preidiction request body needs to be an
compatible payload and also match the expected inputs for the model you've deployed. In this case,
the iris model expects data of shape [1, 4] and of type FP32.
You'll notice that the prediction results for this request come back on outputs[0].data.
You'll notice that the inputs for our tensorflow model look different from the ones we sent to the
iris model. This time, we're sending two arrays of shape [1,16]. When sending an inference request,
we can optionally chose which outputs we want back by including an {"outputs":...} object.
In the samples folder there is a batch request input file: batch-inputs/iris-input.txt. It contains
100 input payloads for our iris model. Let's take a look at the first line in that file:
To run a batch inference job we'll use the . If you don't already have it installed you can install it using:
The inference job can be executed by running the following command:
The mlserver batch component will take your input file batch-inputs/iris-input.txt, distribute
those payloads across 5 different workers (--workers 5), collect the responses and write them
to a file /tmp/iris-output.txt. For a full set of options check out the.
We can check the inference responses by looking at the contents of the output file:
We can run the same batch job for our iris pipeline and store the outputs in a different file:
We can check the inference responses by looking at the contents of the output file:
The samples folder contains an example batch input for the tensorflow model, just as it did for
the scikit-learn model. You can find it at batch-inputs/tfsimple-input.txt. Let's take a look
at the first inference request in the file:
As before, we can run the inference batch job using the mlserver infer command:
We can check the inference responses by looking at the contents of the output file:
You should get the following response:
We can check the inference responses by looking at the contents of the output file:
Now that we've run our batch examples, let's remove the models and pipelines we created:
And finally let's spin down our local instance of Seldon Core:
Learn how to monitor Seldon Core usage metrics, including request rates, latency, and resource utilization for models and pipelines.
There are various interesting system metrics about how Seldon Core 2 is used. These metrics can be recorded anonymously and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.
When provided, these metrics are used to understand the adoption of Seldon Core 2 and how you interact with it. For example, knowing how many clusters Seldon Core 2 is running on, if it is used in Kubernetes or for local development, and how many users are benefitting from features such as multi-model serving.
Hodometer is not an integral part of Seldon Core 2, but rather an independent component which connects to the public APIs of the Seldon Core 2 scheduler. If deployed in Kubernetes, it requests some basic information from the Kubernetes API.
Recorded metrics are sent to Seldon and, optionally, to any you define.
Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.
It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses, model names, or user information. All information sent to Seldon is anonymised with a completely random cluster identifier.
Hodometer supports , so you have full control over what metrics are provided to Seldon, if any.
For transparency, the implementation is fully open-source and designed to be easy to read. The full source code is available , with metrics defined in code . See for an equivalent table of metrics.
Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal impact on CPU, memory, and network consumption.
Hodometer does not store anything it records, so does not have any persistent storage. As a result, it should not be considered a replacement for tools like Prometheus.
Hodometer supports 3 different metrics levels:
Alternatively, usage metrics can be completely disabled. To do so, simply remove any existing deployment of Hodometer or disable it in the installation for your environment, discussed below.
The following environment variables control the behaviour of Hodometer, regardless of the environment it is installed in.
Hodometer is installed as a separate deployment, by default in the same namespace as the rest of the Seldon components.
Helm
If you install Seldon Core v2 by , there are values corresponding to the key environment variables discussed . These Helm values and their equivalents are provided below:
If you do not want usage metrics to be recorded, you can disable Hodometer via the hodometer.disable Helm
value when installing the runtime Helm chart. The following command disables collection of usage metrics in
fresh installations and also serves to remove Hodometer from an existing installation:
Hodometer can be instructed to publish metrics not only to Seldon, but also to any extra endpoints you specify.
This is controlled by the EXTRA_PUBLISH_URLS environment variable, which expects a comma-separated list of
HTTP-compatible URLs.
You might choose to use this for your own usage monitoring. For example, you could capture these metrics and expose them to Prometheus or another monitoring system using your own service.
Metrics are recorded in MixPanel-compatible format, which employs a highly flexible JSON schema.
For an example of how to define your own metrics listener, see the in the hodometer sub-project.
To run this notebook you need the inference data. This can be acquired in two ways:
Run make train or,
gsutil cp -R gs://seldon-models/scv2/examples/income/infer-data .
Show predictions from reference set. Should not be drift or outliers.
Show predictions from drift data. Should be drift and probably not outliers.
Show predictions from outlier data. Should be outliers and probably not drift.
Learn how to install and configure Seldon Core using Helm charts, including component setup and customization options.
Seldon Core 2 provides a highly configurable deployment framework that allows you to fine-tune various components using Helm configuration options. These options offer control over deployment behavior, resource management, logging, autoscaling, and model lifecycle policies to optimize the performance and scalability of machine learning deployments.
This section details the key Helm configuration parameters for Envoy, Autoscaling, Server Prestop, and Model Control Plane, ensuring that you can customize deployment workflows and enhance operational efficiency.
Envoy: Manage pre-stop behaviors and configure access logging to track request-level interactions.
Autoscaling (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.
Servers: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.
Logging: Define log levels for the different components of the system.
Notes:
We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by librdkafka (syslog level). In this case we attempt to map the log level value to the best match.
kubectl create secret generic kafka-broker-tls -n seldon-mesh --from-file ./ca.crtsecurity:
kafka:
ssl:
secret:
brokerValidationSecret: kafka-broker-tls helm upgrade seldon-core-v2-components seldon-charts/seldon-core-v2-setup \
--version 2.8.0 \
-f components-values.yaml \
--namespace seldon-system \
--installkubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/models.mlops.seldon.io/irisa0/infer_rpskubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/servers.mlops.seldon.io/mlserver/infer_rpshelm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoringapiVersion: v1
kind: ConfigMap
metadata:
name: hpa-metrics-prometheus-adapter
namespace: seldon-monitoring
data:
config.yaml: |-
"rules":
-
"seriesQuery": |
{__name__="seldon_model_infer_total",namespace!=""}
"resources":
"overrides":
"model": {group: "mlops.seldon.io", resource: "model"}
"server": {group: "mlops.seldon.io", resource: "server"}
"pod": {resource: "pod"}
"namespace": {resource: "namespace"}
"name":
"matches": "seldon_model_infer_total"
"as": "infer_rps"
"metricsQuery": |
sum by (<<.GroupBy>>) (
rate (
<<.Series>>{<<.LabelMatchers>>}[2m]
)
)sum by (model) (
rate (
seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[2m]
)
)```yaml
"seriesQuery": |
{__name__=~"^seldon_model.*_total",namespace!=""}
"seriesFilters":
- "isNot": "^seldon_.*_seconds_total"
- "isNot": "^seldon_.*_aggregate_.*"
...
```"name":
"matches": "^seldon_model_(.*)_total"
"as": "${1}_rps"- .Series is replaced by the discovered prometheus metric name (e.g. `seldon_model_infer_total`)
- .LabelMatchers, when requesting a namespaced metric for resource `X` with name `x` in namespace `n`, is replaced by `X=~"x",namespace="n"`. For example, `model=~"iris0", namespace="seldon-mesh"`. When requesting the namespace resource itself, only the `namespace="n"` is kept.
- .GroupBy is replaced by the resource type of the requested metric (e.g. `model`, `server`, `pod` or `namespace`).# Replace default prometheus adapter config
kubectl replace -f prometheus-adapter.config.yaml
# Restart prometheus-adapter pods
kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoringkubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq .kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/[NAMESPACE]/[API_RESOURCE_NAME]/[CR_NAME]/[METRIC_NAME]"import numpy as np
import json
import requestswith open('./infer-data/test.npy', 'rb') as f:
x_ref = np.load(f)
x_h1 = np.load(f)
y_ref = np.load(f)
x_outlier = np.load(f)reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"def infer(resourceName: str, batchSz: int, requestType: str):
if requestType == "outlier":
rows = x_outlier[0:0+batchSz]
elif requestType == "drift":
rows = x_h1[0:0+batchSz]
else:
rows = x_ref[0:0+batchSz]
reqJson["inputs"][0]["data"] = rows.flatten().tolist()
reqJson["inputs"][0]["shape"] = [batchSz, rows.shape[1]]
headers = {"Content-Type": "application/json", "seldon-model":resourceName}
response_raw = requests.post(url, json=reqJson, headers=headers)
print(response_raw)
print(response_raw.json())SCHEDULER_PORT
integer
9004
Port for Seldon Core v2 scheduler
LOG_LEVEL
string
info
Level of detail for application logs
is_kubernetes
cluster
Boolean
Whether or not the installation is in Kubernetes
kubernetes_version
cluster
Version number
Kubernetes server version, if inside Kubernetes
node_count
cluster
Integer
Number of nodes in the cluster, if inside Kubernetes
model_count
resource
Integer
Number of Model resources
pipeline_count
resource
Integer
Number of Pipeline resources
experiment_count
resource
Integer
Number of Experiment resources
server_count
resource
Integer
Number of Server resources
server_replica_count
resource
Integer
Total number of Server resource replicas
multimodel_enabled_count
feature
Integer
Number of Server resources with multi-model serving enabled
overcommit_enabled_count
feature
Integer
Number of Server resources with overcommitting enabled
gpu_enabled_count
feature
Integer
Number of Server resources with GPUs attached
inference_server_name
feature
String
Name of inference server, e.g. MLServer or Triton
server_cpu_cores_sum
feature
Float
Total of CPU limits across all Server resource replicas, in cores
server_memory_gb_sum
feature
Float
Total of memory limits across all Server resource replicas, in GiB
Cluster
Basic information about the Seldon Core v2 installation
Resource
High-level information about which Seldon Core v2 resources are used
Feature
More detailed information about how resources are used and whether or not certain feature flags are enabled
METRICS_LEVEL
string
feature
Level of detail for recorded metrics; one of feature, resource, or cluster
EXTRA_PUBLISH_URLS
comma-separated list of URLs
http://<my-endpoint-1>:8000,http://<my-endpoint-2>:8000
Additional endpoints to publish metrics to
SCHEDULER_HOST
string
seldon-scheduler
hodometer.metricsLevel
METRICS_LEVEL
hodometer.extraPublishUrls
EXTRA_PUBLISH_URLS
hodometer.logLevel
LOG_LEVEL
cluster_id
cluster
UUID
A random identifier for this cluster for de-duplication
seldon_core_version
cluster
Version number
E.g. 1.2.3
is_global_installation
cluster
Boolean
Hostname for Seldon Core v2 scheduler
Whether installation is global or namespaced
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/pods/mlserver-0/infer_rpskubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer_rpscd samples/cat models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
memory: 100Kiseldon model load -f models/sklearn-iris-gs.yamlcat pipelines/iris.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: iris-pipeline
spec:
steps:
- name: iris
output:
steps:
- irisseldon pipeline load -f pipelines/iris.yamlcat models/tfsimple1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
seldon model load -f models/tfsimple1.yamlcat pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
seldon pipeline load -f pipelines/tfsimple.yamlseldon model listseldon pipeline listseldon model infer iris '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq{
"model_name": "iris_1",
"model_version": "1",
"id": "a67233c2-2f8c-4fbc-a87e-4e4d3d034c9f",
"parameters": {
"content_type": null,
"headers": null
},
"outputs": [
{
"name": "predict",
"shape": [
1
],
"datatype": "INT64",
"parameters": null,
"data": [
2
]
}
]
}
seldon pipeline infer iris-pipeline '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq{
"model_name": "",
"outputs": [
{
"data": [
2
],
"name": "predict",
"shape": [
1
],
"datatype": "INT64"
}
]
}
seldon model infer tfsimple1 '{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
]
}seldon pipeline infer tfsimple '"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat batch-inputs/iris-input.txt | head -n 1 | jq{
"inputs": [
{
"name": "predict",
"data": [
0.38606369295833043,
0.006894049558299753,
0.6104082981607108,
0.3958954239450676
],
"datatype": "FP64",
"shape": [
1,
4
]
}
]
}
pip install mlservermlserver infer -u localhost:9000 -m iris -i batch-inputs/iris-input.txt -o /tmp/iris-output.txt --workers 52023-01-22 18:24:17,272 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-22 18:24:17,273 [mlserver] INFO - server url: localhost:9000
2023-01-22 18:24:17,273 [mlserver] INFO - model name: iris
2023-01-22 18:24:17,273 [mlserver] INFO - request headers: {}
2023-01-22 18:24:17,273 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-01-22 18:24:17,273 [mlserver] INFO - output file path: /tmp/iris-output.txt
2023-01-22 18:24:17,273 [mlserver] INFO - workers: 5
2023-01-22 18:24:17,273 [mlserver] INFO - retries: 3
2023-01-22 18:24:17,273 [mlserver] INFO - batch interval: 0.0
2023-01-22 18:24:17,274 [mlserver] INFO - batch jitter: 0.0
2023-01-22 18:24:17,274 [mlserver] INFO - connection timeout: 60
2023-01-22 18:24:17,274 [mlserver] INFO - micro-batch size: 1
2023-01-22 18:24:17,420 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-22 18:24:17,421 [mlserver] INFO - Total processed instances: 100
2023-01-22 18:24:17,421 [mlserver] INFO - Time taken: 0.15 seconds
cat /tmp/iris-output.txt | head -n 1 | jqmlserver infer -u localhost:9000 -m iris-pipeline.pipeline -i batch-inputs/iris-input.txt -o /tmp/iris-pipeline-output.txt --workers 52023-01-22 18:25:18,651 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-22 18:25:18,653 [mlserver] INFO - server url: localhost:9000
2023-01-22 18:25:18,653 [mlserver] INFO - model name: iris-pipeline.pipeline
2023-01-22 18:25:18,653 [mlserver] INFO - request headers: {}
2023-01-22 18:25:18,653 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-01-22 18:25:18,653 [mlserver] INFO - output file path: /tmp/iris-pipeline-output.txt
2023-01-22 18:25:18,653 [mlserver] INFO - workers: 5
2023-01-22 18:25:18,653 [mlserver] INFO - retries: 3
2023-01-22 18:25:18,653 [mlserver] INFO - batch interval: 0.0
2023-01-22 18:25:18,653 [mlserver] INFO - batch jitter: 0.0
2023-01-22 18:25:18,653 [mlserver] INFO - connection timeout: 60
2023-01-22 18:25:18,653 [mlserver] INFO - micro-batch size: 1
2023-01-22 18:25:18,963 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-22 18:25:18,963 [mlserver] INFO - Total processed instances: 100
2023-01-22 18:25:18,963 [mlserver] INFO - Time taken: 0.31 secondscat /tmp/iris-pipeline-output.txt | head -n 1 | jqcat batch-inputs/tfsimple-input.txt | head -n 1 | jq{
"inputs": [
{
"name": "INPUT0",
"data": [
75,
39,
9,
44,
32,
97,
99,
40,
13,
27,
25,
36,
18,
77,
62,
60
],
"datatype": "INT32",
"shape": [
1,
16
]
},
{
"name": "INPUT1",
"data": [
39,
7,
14,
58,
13,
88,
98,
66,
97,
57,
49,
3,
49,
63,
37,
12
],
"datatype": "INT32",
"shape": [
1,
16
]
}
]
}mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-output.txt --workers 10
2023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-output.txt
2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.21 secondscat /tmp/tfsimple-output.txt | head -n 1 | jq{
"model_name": "tfsimple1_1",
"model_version": "1",
"id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9",
"parameters": {
"batch_index": 0,
"inference_id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9"
},
"outputs": [
{
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
114,
46,
23,
102,
45,
185,
197,
106,
110,
84,
74,
39,
67,
140,
99,
72
]
},
{
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
36,
32,
-5,
-14,
19,
9,
1,
-26,
-84,
-30,
-24,
33,
-31,
14,
25,
48
]
}
]
}mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-pipeline-output.txt --workers 102023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-pipeline-output.txt
2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.25 seconds
cat /tmp/tfsimple-pipeline-output.txt | head -n 1 | jqseldon model unload irisseldon model unload tfsimple1seldon pipeline unload iris-pipelineseldon pipeline unload tfsimplecd ../ && make undeploy-localhelm install seldon-v2-runtime k8s/helm-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
--set hodometer.disable=truecat ../../models/income-preprocess.yaml
echo "---"
cat ../../models/income.yaml
echo "---"
cat ../../models/income-drift.yaml
echo "---"
cat ../../models/income-outlier.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-preprocess
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/preprocessor"
requirements:
- sklearn
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
requirements:
- sklearn
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-drift
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/drift-detector"
requirements:
- mlserver
- alibi-detect
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-outlier
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/outlier-detector"
requirements:
- mlserver
- alibi-detect
kubectl apply -f ../../models/income-preprocess.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/income.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/income-drift.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/income-outlier.yaml -n ${NAMESPACE}model.mlops.seldon.io/income-preprocess created
model.mlops.seldon.io/income created
model.mlops.seldon.io/income-drift created
model.mlops.seldon.io/income-outlier createdkubectl wait --for condition=ready --timeout=300s model income-preprocess -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model income -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model income-drift -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model income-outlier -n ${NAMESPACE}model.mlops.seldon.io/income-preprocess condition met
model.mlops.seldon.io/income condition met
model.mlops.seldon.io/income-drift condition met
model.mlops.seldon.io/income-outlier condition metseldon model load -f ../../models/income-preprocess.yaml
seldon model load -f ../../models/income.yaml
seldon model load -f ../../models/income-drift.yaml
seldon model load -f ../../models/income-outlier.yaml{}
{}
{}
{}seldon model status income-preprocess -w ModelAvailable | jq .
seldon model status income -w ModelAvailable | jq .
seldon model status income-drift -w ModelAvailable | jq .
seldon model status income-outlier -w ModelAvailable | jq .{}
{}
{}
{}cat ../../pipelines/income.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: income-production
spec:
steps:
- name: income
- name: income-preprocess
- name: income-outlier
inputs:
- income-preprocess
- name: income-drift
batch:
size: 20
output:
steps:
- income
- income-outlier.outputs.is_outlier
kubectl apply -f ../../pipelines/income.yaml -n ${NAMESPACE}pipeline.mlops.seldon.io/income createdkubectl wait --for condition=ready --timeout=300s pipelines income -n ${NAMESPACE}pipeline.mlops.seldon.io/choice condition metseldon pipeline load -f ../../pipelines/income.yamlseldon pipeline status income-production -w PipelineReady | jq -M .{
"pipelineName": "income-production",
"versions": [
{
"pipeline": {
"name": "income-production",
"uid": "cifej8iufmbc73e5int0",
"version": 1,
"steps": [
{
"name": "income"
},
{
"name": "income-drift",
"batch": {
"size": 20
}
},
{
"name": "income-outlier",
"inputs": [
"income-preprocess.outputs"
]
},
{
"name": "income-preprocess"
}
],
"output": {
"steps": [
"income.outputs",
"income-outlier.outputs.is_outlier"
]
},
"kubernetesMeta": {}
},
"state": {
"pipelineVersion": 1,
"status": "PipelineReady",
"reason": "created pipeline",
"lastChangeTimestamp": "2023-06-30T14:41:38.343754921Z",
"modelsReady": true
}
}
]
}
batchSz=20
print(y_ref[0:batchSz])
infer("income-production.pipeline",batchSz,"normal")[0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
<Response [200]>
{'model_name': '', 'outputs': [{'data': [0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect income-production.income-drift.outputs.is_driftseldon.default.model.income-drift.outputs cifej9gfh5ss738i5br0 {"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
batchSz=20
print(y_ref[0:batchSz])
infer("income-production.pipeline",batchSz,"drift")[0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
<Response [200]>
{'model_name': '', 'outputs': [{'data': [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect income-production.income-drift.outputs.is_driftseldon.default.model.income-drift.outputs cifejaofh5ss738i5brg {"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["1"]}}
batchSz=20
print(y_ref[0:batchSz])
infer("income-production.pipeline",batchSz,"outlier")[0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1]
<Response [200]>
{'model_name': '', 'outputs': [{'data': [0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1], 'name': 'predict', 'shape': [20, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}, {'data': [1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect income-production.income-drift.outputs.is_driftseldon.default.model.income-drift.outputs cifejb8fh5ss738i5bs0 {"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
cat ../../models/income-explainer.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
explainer:
type: anchor_tabular
modelRef: income
kubectl apply -f ../../models/income-explainer.yaml -n ${NAMESPACE}pipeline.mlops.seldon.io/income-explainer createdkubectl wait --for condition=ready --timeout=300s pipelines income-explainer -n ${NAMESPACE}pipeline.mlops.seldon.io/income-explainer condition metseldon model load -f ../../models/income-explainer.yaml{}seldon model status income-explainer -w ModelAvailable | jq .{}batchSz=1
print(y_ref[0:batchSz])
infer("income-explainer",batchSz,"normal")[0]
<Response [200]>
{'model_name': 'income-explainer_1', 'model_version': '1', 'id': 'cdd68ba5-c569-4930-886f-fbdc26e24866', 'parameters': {}, 'outputs': [{'name': 'explanation', 'shape': [1, 1], 'datatype': 'BYTES', 'parameters': {'content_type': 'str'}, 'data': ['{"meta": {"name": "AnchorTabular", "type": ["blackbox"], "explanations": ["local"], "params": {"seed": 1, "disc_perc": [25, 50, 75], "threshold": 0.95, "delta": 0.1, "tau": 0.15, "batch_size": 100, "coverage_samples": 10000, "beam_size": 1, "stop_on_first": false, "max_anchor_size": null, "min_samples_start": 100, "n_covered_ex": 10, "binary_cache_size": 10000, "cache_margin": 1000, "verbose": false, "verbose_every": 1, "kwargs": {}}, "version": "0.9.1"}, "data": {"anchor": ["Marital Status = Never-Married", "Relationship = Own-child", "Capital Gain <= 0.00"], "precision": 0.9942028985507246, "coverage": 0.0657, "raw": {"feature": [3, 5, 8], "mean": [0.7914951989026063, 0.9400749063670412, 0.9942028985507246], "precision": [0.7914951989026063, 0.9400749063670412, 0.9942028985507246], "coverage": [0.3043, 0.069, 0.0657], "examples": [{"covered_true": [[30, 0, 1, 1, 0, 1, 1, 0, 0, 0, 50, 2], [49, 4, 2, 1, 6, 0, 4, 1, 0, 0, 60, 9], [39, 2, 5, 1, 5, 0, 4, 1, 0, 0, 40, 9], [33, 4, 2, 1, 5, 0, 4, 1, 0, 0, 40, 9], [63, 4, 1, 1, 8, 1, 4, 0, 0, 0, 40, 9], [23, 4, 1, 1, 7, 1, 4, 1, 0, 0, 66, 8], [45, 4, 1, 1, 8, 0, 1, 1, 0, 0, 40, 1], [54, 4, 1, 1, 8, 4, 4, 1, 0, 0, 45, 9], [32, 6, 1, 1, 8, 4, 2, 0, 0, 0, 30, 9], [40, 5, 1, 1, 2, 0, 4, 1, 0, 0, 40, 9]], "covered_false": [[57, 4, 5, 1, 5, 0, 4, 1, 0, 1977, 45, 9], [53, 0, 5, 1, 0, 1, 4, 0, 8614, 0, 35, 9], [37, 4, 1, 1, 5, 0, 4, 1, 0, 0, 45, 9], [53, 4, 5, 1, 8, 0, 4, 1, 0, 1977, 55, 9], [35, 4, 1, 1, 8, 0, 4, 1, 7688, 0, 50, 9], [32, 4, 1, 1, 5, 1, 4, 1, 0, 0, 40, 9], [42, 4, 1, 1, 5, 0, 4, 1, 99999, 0, 40, 9], [32, 4, 1, 1, 8, 0, 4, 1, 15024, 0, 50, 9], [53, 7, 5, 1, 8, 0, 4, 1, 0, 0, 42, 9], [52, 1, 1, 1, 8, 0, 4, 1, 0, 0, 45, 9]], "uncovered_true": [], "uncovered_false": []}, {"covered_true": [[52, 7, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [27, 4, 1, 1, 8, 3, 4, 1, 0, 0, 40, 9], [28, 4, 1, 1, 6, 3, 4, 1, 0, 0, 60, 9], [46, 6, 5, 1, 2, 3, 4, 1, 0, 0, 50, 9], [53, 2, 5, 1, 5, 3, 2, 0, 0, 1669, 35, 9], [27, 4, 5, 1, 8, 3, 4, 0, 0, 0, 40, 9], [25, 4, 1, 1, 8, 3, 4, 0, 0, 0, 40, 9], [29, 6, 5, 1, 2, 3, 4, 1, 0, 0, 30, 9], [64, 0, 1, 1, 0, 3, 4, 1, 0, 0, 50, 9], [63, 0, 5, 1, 0, 3, 4, 1, 0, 0, 30, 9]], "covered_false": [[50, 5, 1, 1, 8, 3, 4, 1, 15024, 0, 60, 9], [45, 6, 1, 1, 6, 3, 4, 1, 14084, 0, 45, 9], [37, 4, 1, 1, 8, 3, 4, 1, 15024, 0, 40, 9], [33, 4, 1, 1, 8, 3, 4, 1, 15024, 0, 60, 9], [41, 6, 5, 1, 8, 3, 4, 1, 7298, 0, 70, 9], [42, 6, 1, 1, 2, 3, 4, 1, 15024, 0, 60, 9]], "uncovered_true": [], "uncovered_false": []}, {"covered_true": [[41, 4, 1, 1, 1, 3, 4, 1, 0, 0, 40, 9], [55, 2, 5, 1, 8, 3, 4, 1, 0, 0, 50, 9], [35, 4, 5, 1, 5, 3, 4, 0, 0, 0, 32, 9], [31, 4, 1, 1, 2, 3, 4, 1, 0, 0, 40, 9], [47, 4, 1, 1, 1, 3, 4, 1, 0, 0, 40, 9], [33, 4, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [58, 0, 1, 1, 0, 3, 4, 0, 0, 0, 50, 9], [44, 6, 1, 1, 2, 3, 4, 1, 0, 0, 90, 9], [30, 4, 1, 1, 6, 3, 4, 1, 0, 0, 40, 9], [25, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9]], "covered_false": [], "uncovered_true": [], "uncovered_false": []}], "all_precision": 0, "num_preds": 1000000, "success": true, "names": ["Marital Status = Never-Married", "Relationship = Own-child", "Capital Gain <= 0.00"], "prediction": [0], "instance": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], "instances": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}']}]}
kubectl delete -f ../../piplines/income-production.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/income-preprocess.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/income.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/income-drift.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/income-outlier.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/income-explainer.yaml -n ${NAMESPACE}seldon pipeline unload income-production
seldon model unload income-preprocess
seldon model unload income
seldon model unload income-drift
seldon model unload income-outlier
seldon model unload income-explainer---
apiVersion: mlops.seldon.io/v1alpha1
kind: ServerConfig
metadata:
name: mlserver
spec:
podSpec:
terminationGracePeriodSeconds: 120
serviceAccountName: seldon-server
containers:
- image: rclone:latest
imagePullPolicy: IfNotPresent
name: rclone
command:
- rclone
args:
- rcd
- --rc-no-auth
- --config=/rclone/rclone.conf
- --rc-addr=0.0.0.0:5572
- --max-buffer-memory=$(MAX_BUFFER_MEMORY)
env:
- name: MAX_BUFFER_MEMORY
value: "64M"
ports:
- containerPort: 5572
name: rclone
protocol: TCP
lifecycle:
preStop:
httpGet:
port: 9007
path: terminate
resources:
limits:
memory: '256M'
requests:
cpu: "200m"
memory: '100M'
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 5572
timeoutSeconds: 1
livenessProbe:
failureThreshold: 1
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
exec:
command:
- rclone
- rc
- rc/noop
timeoutSeconds: 1
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
- image: agent:latest
imagePullPolicy: IfNotPresent
command:
- /bin/agent
args:
- --tracing-config-path=/mnt/tracing/tracing.json
name: agent
env:
- name: SELDON_SERVER_CAPABILITIES
value: "mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost"
- name: SELDON_OVERCOMMIT_PERCENTAGE
value: "10"
- name: SELDON_MODEL_INFERENCE_LAG_THRESHOLD
value: "30"
- name: SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD
value: "600"
- name: SELDON_SCALING_STATS_PERIOD_SECONDS
value: "20"
- name: SELDON_SERVER_HTTP_PORT
value: "9000"
- name: SELDON_SERVER_GRPC_PORT
value: "9500"
- name: SELDON_REVERSE_PROXY_HTTP_PORT
value: "9001"
- name: SELDON_REVERSE_PROXY_GRPC_PORT
value: "9501"
- name: SELDON_SCHEDULER_HOST
value: "seldon-scheduler"
- name: SELDON_SCHEDULER_PORT
value: "9005"
- name: SELDON_SCHEDULER_TLS_PORT
value: "9055"
- name: SELDON_METRICS_PORT
value: "9006"
- name: SELDON_DRAINER_PORT
value: "9007"
- name: SELDON_READINESS_PORT
value: "9008"
- name: AGENT_TLS_SECRET_NAME
value: ""
- name: AGENT_TLS_FOLDER_PATH
value: ""
- name: SELDON_SERVER_TYPE
value: "mlserver"
- name: SELDON_ENVOY_HOST
value: "seldon-mesh"
- name: SELDON_ENVOY_PORT
value: "80"
- name: SELDON_LOG_LEVEL
value: "warn"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MEMORY_REQUEST
valueFrom:
resourceFieldRef:
containerName: mlserver
resource: requests.memory
- name: SELDON_USE_DEPLOYMENTS_FOR_SERVERS
value: "false"
ports:
- containerPort: 9501
name: grpc
protocol: TCP
- containerPort: 9001
name: http
protocol: TCP
- containerPort: 9006
name: metrics
protocol: TCP
- containerPort: 9008
name: readiness-port
lifecycle:
preStop:
httpGet:
port: 9007
path: terminate
readinessProbe:
httpGet:
path: /ready
port: 9008
failureThreshold: 1
periodSeconds: 5
startupProbe:
httpGet:
path: /ready
port: 9008
failureThreshold: 60
periodSeconds: 15
resources:
requests:
cpu: "500m"
memory: '500M'
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
- name: config-volume
mountPath: /mnt/config
- name: tracing-config-volume
mountPath: /mnt/tracing
- image: mlserver:latest
imagePullPolicy: IfNotPresent
env:
- name: MLSERVER_HTTP_PORT
value: "9000"
- name: MLSERVER_GRPC_PORT
value: "9500"
- name: MLSERVER_MODELS_DIR
value: "/mnt/agent/models"
- name: MLSERVER_PARALLEL_WORKERS
value: "1"
- name: MLSERVER_LOAD_MODELS_AT_STARTUP
value: "false"
- name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
value: "1048576000" # 100MB (100 * 1024 * 1024)
resources:
requests:
cpu: 1
memory: '1G'
lifecycle:
preStop:
httpGet:
port: 9007
path: terminate
livenessProbe:
httpGet:
path: /v2/health/live
port: server-http
readinessProbe:
httpGet:
path: /v2/health/live
port: server-http
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /v2/health/live
port: server-http
failureThreshold: 10
periodSeconds: 10
name: mlserver
ports:
- containerPort: 9500
name: server-grpc
protocol: TCP
- containerPort: 9000
name: server-http
protocol: TCP
- containerPort: 8082
name: server-metrics
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
readOnly: true
- mountPath: /mnt/certs
name: downstream-ca-certs
readOnly: true
securityContext:
fsGroup: 2000
runAsUser: 1000
runAsNonRoot: true
volumes:
- name: config-volume
configMap:
name: seldon-agent
- name: tracing-config-volume
configMap:
name: seldon-tracing
- name: downstream-ca-certs
secret:
secretName: seldon-downstream-server
optional: true
volumeClaimTemplates:
- name: mlserver-models
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers
Values in Seldon Core 2
In Seldon Core 2 you need to specify these values:
security.kafka.sasl.mechanism - set to OAUTHBEARER
security.kafka.sasl.client.secret - Created secret with client credentials
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka brokers
The resulting set of values in components-values.yaml is similar to:
The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.
ca.crt.These certificates are expected to be encoded as PEM certificates and are provided through a secret, which can be created in teh namespace seldon:
This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.
Ensure that the field used within the secret follow the same naming convention: tls.crt, tls.key and ca.crt. Otherwise, Seldon Core 2 may not be able to find the correct set of certificates.
Reference these certificates it within the corresponding Helm values for Seldon Core 2 installation.
Values for Seldon Core 2 In Seldon Core 2 you need to specify these values:
security.kafka.ssl.client.secret - Secret name containing client certificates
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers
The resulting set of values in values.yaml is similar to:
The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let's Encrypt.
bootstrap.serversCreating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.
Create a secret named confluent-kafka-sasl for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with with the value of Secret that you generated in Confluent cloud.
Cluster ID: Cluster Overview → Cluster Settings → General → Identification
Identity Pool ID: Accounts & access → Identity providers → .
Obtain these details from your identity providers such as Keycloak or Azure AD.
Client ID
Client secret
Token Endpoint URL
If you are using Azure AD you may will need to set scope: api://<client id>/.default.
Creating Kubernetes secret
Create Kubernetes secrets to store the required client credentials. For example, create a kafka-secret.yaml file by replacing the values of <client id>, <client secret>, <token endpoint url>, <cluster id>,<identity pool id> with the values that you obtained from Confluent Cloud and your identity provider.
Provide the secret named confluent-kafka-oauth in the seldon namespace to configure with Seldon Core 2.
This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.
replace <username> with thevalue of Key that you generated in Confluent Cloud.replace <confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.
Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:
<confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:
agent.modelInferenceLagThreshold
components
Queue lag threshold to trigger scaling up of a model replica.
30
agent.modelInactiveSecondsThreshold
components
Period with no activity after which to trigger scaling down of a model replica.
600
autoscaling.serverPackingEnabled
components
Whether packing of models onto fewer servers is enabled.
false
autoscaling.serverPackingPercentage
components
Percentage of events where packing is allowed. Higher values represent more aggressive packing. This is only used when serverPackingEnabled is set. Range is from 0.0 to 1.0
0.0
agent.maxUnloadElapsedTimeMinutes
components
Max time allowed for one model unload command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
15
agent.maxUnloadRetryCount
components
Max number of retries for unsuccessful unload command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
5
agent.unloadGracePeriodSeconds
components
A period guarding against race conditions between Envoy actually applying the cluster change to remove a route and before proceeding with the model replica unloading command.
2
dataflow.logLevelKafka
components
check klogging level
scheduler.logLevel
components
check logrus log level
modelgateway.logLevel
components
check logrus log level
pipelinegateway.logLevel
components
check logrus log level
hodometer.logLevel
components
check logrus log level
serverConfig.rclone.logLevel
components
check rclone log-level
serverConfig.agent.logLevel
components
check logrus log level
envoy.preStopSleepPeriodSeconds
components
Sleep after calling prestop command.
30
envoy.terminationGracePeriodSeconds
components
Grace period to wait for prestop to finish for Envoy pods.
120
envoy.enableAccesslog
components
Whether to enable logging of requests.
true
envoy.accesslogPath
components
Path on disk to store logfile. This is only used when enableAccesslog is set.
/tmp/envoy-accesslog.txt
envoy.includeSuccessfulRequests
components
Whether to including successful requests. If set to false, then only failed requests are logged. This is only used when enableAccesslog is set.
autoscaling.autoscalingModelEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
false
autoscaling.autoscalingServerEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
true
agent.scalingStatsPeriodSeconds
components
Sampling rate for metrics used for autoscaling.
serverConfig.terminationGracePeriodSeconds
components
Grace period to wait for prestop process to finish for this particular Server pod.
120
agent.overcommitPercentage
components
Overcommit percentage (of memory) allowed. Range is from 0 to 100
10
agent.maxLoadElapsedTimeMinutes
components
Max time allowed for one model load command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
120
agent.maxLoadRetryCount
components
Max number of retries for unsuccessful load command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
logging.logLevel
components
Components wide settings for logging level, if individual component levels are not set. Options are: debug, info, error.
info
controller.logLevel
components
check zap log level here
dataflow.logLevel
components
false
20
5
check klogging level
Once metrics for custom autoscaling are configured (see ), Kubernetes resources, including Models, can be autoscaled using HPA by applying an HPA manifest that targets the chosen scaling metric.
Consider a model named irisa0 with the following manifest. Please note we don't set minReplicas/maxReplicas in order to disable the Seldon inference-lag-based autoscaling so that it doesn't interact with HPA (separate minReplicas/maxReplicas configs will be set on the HPA side)
Core 2 supports full horizontal scaling for the dataflow engine, model gateway, and pipeline gateway. Each service automatically distributes pipelines or models across replicas using consistent hashing, so you don’t need to manually assign workloads.
This guide explains how scaling works, what configuration controls it, and what happens when replicas or pipelines/models change.
Learn how to perform model inference in Seldon Core using REST and gRPC protocols, including request/response formats and client examples.
This section will discuss how to make inference calls against your Seldon models or pipelines.
You can make synchronous inference requests via REST or gRPC or asynchronous requests via Kafka topics. The content of your request should be an :
REST payloads will generally be in the JSON v2 protocol format.
gRPC and Kafka payloads must be in the Protobuf v2 protocol format.
The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies. The endpoint model readiness endpoints report that an individual model is loaded and ready to serve. It is intended only to give customers visibility into the model’s state and is not intended to be used as a Kubernetes readiness probe for the MLServer container. If you use a model-specific health endpoint for the container readiness probe this can cause a deadlock based on the current implementation, because - the Seldon agent does not begin model download until the Pod’s IP is visible in endpoints; – the IP of the Pod is only published after the Pod is Ready or all internal readiness checks have passed; – the MLServer container only becomes Ready once the model is loaded. This would result in the agent never downloading the model and the Pod never becoming Ready. For container-level readiness checks we recommend the server-level readiness endpoints. These indicate that the MLServer process is up and accepting health checks and does not deadlock the agent/model loading flow.
The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
OK
Not Found
An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
apiVersion: v1
kind: Secret
metadata:
name: confluent-kafka-oauth
type: Opaque
stringData:
method: OIDC
client_id: <client id>
client_secret: <client secret>
token_endpoint_url: <token endpoint url>
extensions: logicalCluster=<cluster id>,identityPoolId=<identity pool id>
scope: ""kubectl create secret generic kafka-sasl-secret --from-literal password=<kafka-password> -n seldon-meshsecurity:
kafka:
protocol: SASL_SSL
sasl:
mechanism: SCRAM-SHA-512
client:
username: <kafka-username> # TODO: Replace with your Kafka username
secret: kafka-sasl-secret # NOTE: Secret name from previous step
ssl:
client:
secret: # NOTE: Leave empty
brokerValidationSecret: kafka-broker-tls # NOTE: OptionalapiVersion: v1
kind: Secret
metadata:
name: kafka-oauth
type: Opaque
stringData:
method: OIDC
client_id: <client id>
client_secret: <client secret>
token_endpoint_url: <token endpoint url>
extensions: ""
scope: ""kubectl apply -f kafka-oauth.yaml -n seldon-meshEndpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXXkubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldonkubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldon-systemcontroller:
clusterwide: true
dataflow:
resources:
cpu: 500m
envoy:
service:
type: ClusterIP
kafka:
bootstrap: <namespace>.servicebus.windows.net:9093
topics:
replicationFactor: 3
numPartitions: 4
security:
kafka:
protocol: SASL_SSL
sasl:
mechanism: "PLAIN"
client:
username: $ConnectionString
secret: azure-kafka-secret
ssl:
client:
secret:
brokerValidationSecret:
opentelemetry:
enable: false
scheduler:
service:
type: ClusterIP
serverConfig:
mlserver:
resources:
cpu: 1
memory: 2Gi
triton:
resources:
cpu: 1
memory: 2Gi
serviceGRPCPrefix: "http2-"security:
kafka:
protocol: SASL_SSL
sasl:
mechanism: OAUTHBEARER
client:
secret: kafka-oauth # NOTE: Secret name from earlier step
ssl:
client:
secret: # NOTE: Leave empty
brokerValidationSecret: kafka-broker-tls # NOTE: Optionalkubectl create secret generic kafka-client-tls -n seldon \
--from-file ./tls.crt \
--from-file ./tls.key \
--from-file ./ca.crtsecurity:
kafka:
protocol: SSL
ssl:
client:
secret: kafka-client-tls # NOTE: Secret name from earlier step
brokerValidationSecret: kafka-broker-tls # NOTE: Optionalkubectl create secret generic confluent-kafka-sasl --from-literal password="<password>" -n seldoncontroller:
clusterwide: true
dataflow:
resources:
cpu: 500m
envoy:
service:
type: ClusterIP
kafka:
bootstrap: <confluent-endpoints>
topics:
replicationFactor: 3
numPartitions: 4
consumer:
messageMaxBytes: 8388608
producer:
messageMaxBytes: 8388608
security:
kafka:
protocol: SASL_SSL
sasl:
mechanism: "PLAIN"
client:
username: <username>
secret: confluent-kafka-sasl
ssl:
client:
secret:
brokerValidationSecret:
opentelemetry:
enable: false
scheduler:
service:
type: ClusterIP
serverConfig:
mlserver:
resources:
cpu: 1
memory: 2Gi
triton:
resources:
cpu: 1
memory: 2Gi
serviceGRPCPrefix: "http2-"controller:
clusterwide: true
dataflow:
resources:
cpu: 500m
envoy:
service:
type: ClusterIP
kafka:
bootstrap: <confluent-endpoints>
topics:
replicationFactor: 3
numPartitions: 4
consumer:
messageMaxBytes: 8388608
producer:
messageMaxBytes: 8388608
security:
kafka:
protocol: SASL_SSL
sasl:
mechanism: OAUTHBEARER
client:
secret: confluent-kafka-oauth
ssl:
client:
secret:
brokerValidationSecret:
opentelemetry:
enable: false
scheduler:
service:
type: ClusterIP
serverConfig:
mlserver:
resources:
cpu: 1
memory: 2Gi
triton:
resources:
cpu: 1
memory: 2Gi
serviceGRPCPrefix: "http2-"spec.replicas. This is the key modified by HPA to increase the number of replicas, and if not present in the manifest it will result in HPA not working until the Model CR is modified to have spec.replicas defined.Let's scale this model when it is deployed on a server named mlserver, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):
The Object metric allows for two target value types: AverageValue and Value. Of the two, only AverageValue is supported for the current Seldon Core 2 setup. The Value target type is typically used for metrics describing the utilization of a resource and would not be suitable for RPS-based scaling.
The example HPA manifests use metrics of type "Object" that fetch the data used in scaling decisions by querying k8s metrics associated with a particular k8s object. The endpoints that HPA uses for fetching those metrics are the same ones that were tested in the previous section using kubectl get --raw .... Because you have configured the Prometheus Adapter to expose those k8s metrics based on queries to Prometheus, a mapping exists between the information contained in the HPA Object metric definition and the actual query that is executed against Prometheus. This section aims to give more details on how this mapping works.
In our example, the metric.name:infer_rps gets mapped to the seldon_model_infer_total metric on the prometheus side, based on the configuration in the name section of the Prometheus Adapter ConfigMap. The prometheus metric name is then used to fill in the <<.Series>> template in the query (metricsQuery in the same ConfigMap).
Then, the information provided in the describedObject is used within the Prometheus query to select the right aggregations of the metric. For the RPS metric used to scale the Model (and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across all the replicas of a given model, so the describedObject references a specific Model CR.
However, in the general case, the describedObject does not need to be a Model. Any k8s object listed in the resources section of the Prometheus Adapter ConfigMap may be used. The Prometheus label associated with the object kind fills in the <<.GroupBy>> template, while the name gets used as part of the <<.LabelMatchers>>. For example:
If the described object is { kind: Namespace, name: seldon-mesh }, then the Prometheus query template configured in our example would be transformed into:
If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:
The target section establishes the thresholds used in scaling decisions. For RPS, the AverageValue target type refers to the threshold per replica RPS above which the number of the scaleTargetRef (Model or Server) replicas should be increased. The target number of replicas is being computed by HPA according to the following formula:
As an example, if averageValue=50 and infer_rps=150, the targetReplicas would be 3.
Importantly, computing the target number of replicas does not require knowing the number of active pods currently associated with the Server or Model. This is what allows both the Model and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would attempt to take ownership of the same set of pods, and transition into a failure state.
This is also why the Value target type is not currently supported. In this case, HPA first computes an utilizationRatio:
As an example, if threshold_value=100 and custom_metric_value=200, the utilizationRatio would be 2. HPA deduces from this that the number of active pods associated with the scaleTargetRef object should be doubled, and expects that once that target is achieved, the custom_metric_value will become equal to the threshold_value (utilizationRatio=1). However, by using the number of active pods, the HPA CRs for both the Model and the Server also try to take exclusive ownership of the same set of pods, and fail.
Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds). When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value][target]. This information is updated in accordance to the sampling rate of each HPA resource.
Once Model autoscaling is set up (either through HPA, or by Seldon Core), users will need to configure Server autoscaling. You can use Seldon Core's native autoscaling functionality for Servers here.
Otherwise, if you want to scale Servers using HPA as well - this only works in a setup where all Models and Servers have a 1-1 maping - you will also need to set up HPA manifests for Servers. This is explained in more detail here.
Filtering metrics by additional labels on the prometheus metric - The prometheus metric from which the model RPS is computed has the following labels managed by Seldon Core 2:
If you want the scaling metric to be computed based on a subset of the Prometheus time series with particular label values (labels either managed by Seldon Core 2 or added automatically within your infrastructure), you can add this as a selector the HPA metric config. This is shown in the following example, which scales only based on the RPS of REST requests as opposed to REST + gRPC:
Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs
For more resources, please consult the HPA docs and the HPA walkthrough
When deploying HPA-based scaling for Seldon Core 2 models and servers as part of a production deployment, it is important to understand the exact interactions between HPA-triggered actions and Seldon Core 2 scheduling, as well as potential pitfalls in choosing particular HPA configurations.
Using the default scaling policy, HPA is relatively aggressive on scale-up (responding quickly to increases in load), with a maximum replicas increase of either 4 every 15 seconds or 100% of existing replicas within the same period (whichever is highest). In contrast, scaling-down is more gradual, with HPA only scaling down to the maximum number of recommended replicas in the most recent 5 minute rolling window, in order to avoid flapping. Those parameters can be customized via scaling policies.
When using custom metrics such as RPS, the actual number of replicas added during scale-up or reduced during scale-down will entirely depend, alongside the maximums imposed by the policy, on the configured target (averageValue RPS per replica) and on how quickly the inferencing load varies in your cluster. All three need to be considered jointly in order to deliver both an efficient use of resources and meeting SLAs.
Naturally, the first thing to consider is an estimated peak inference load (including some margins) for each of the models in the cluster. If the minimum number of model replicas needed to serve that load without breaching latency SLAs is known, it should be set as spec.maxReplicas, with the HPA target.averageValue set to peak_infer_RPS/maxReplicas.
If maxReplicas is not already known, an open-loop load test with a slowly ramping up request rate should be done on the target model (one replica, no scaling). This would allow you to determine the RPS (inference request throughput) when latency SLAs are breached or (depending on the desired operation point) when latency starts increasing. You would then set the HPA target.averageValue taking some margin below this saturation RPS, and compute spec.maxReplicas as peak_infer_RPS/target.averageValue. The margin taken below the saturation point is very important, because scaling-up cannot be instant (it requires spinning up new pods, downloading model artifacts, etc.). In the period until the new replicas become available, any load increases will still need to be absorbed by the existing replicas.
If there are multiple models which typically experience peak load in a correlated manner, you need to ensure that sufficient cluster resources are available for k8s to concurrently schedule the maximum number of server pods, with each pod holding one model replica. This can be ensured by using either Cluster Autoscaler or, when running workloads in the cloud, any provider-specific cluster autoscaling services.
It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.
Not having sufficient cluster resources to serve the number of replicas configured by HPA at a given moment, in particular under aggressive scale-up HPA policies, may result in breaches of SLAs. This is discussed in more detail in the following section.
A similar approach should be taken for setting minReplicas, in relation to estimated RPS in the low-load regime. However, it's useful to balance lower resource usage to immediate availability of replicas for inference rate increases from that lowest load point. If low-load regimes only occur for small periods of time, and especially combined with a high rate of increase in RPS when moving out of the low-load regime, it might be worth to set the minReplicas floor higher in order to ensure SLAs are met at all times.
The following elements are important to take into account when setting the HPA policies for models:
The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
Say you configures a scale-up stabilization window of one minute. This means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
It is useful to consider whether the replica scale-up rate configured via the policy is able to keep-up with this RPS increase rate.
Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one
OK
Not Found
apiVersion: mlops.seldon.io/v1alpha1
kind: SeldonConfig
metadata:
name: default
spec:
components:
- name: seldon-dataflow-engine
replicas: 1
podSpec:
containers:
- env:
- name: SELDON_UPSTREAM_HOST
value: seldon-scheduler
- name: SELDON_UPSTREAM_PORT
value: "9008"
- name: OTEL_JAVAAGENT_ENABLED
valueFrom:
configMapKeyRef:
key: OTEL_JAVAAGENT_ENABLED
name: seldon-tracing
- name: OTEL_EXPORTER_OTLP_ENDPOINT
valueFrom:
configMapKeyRef:
key: OTEL_EXPORTER_OTLP_ENDPOINT
name: seldon-tracing
- name: OTEL_EXPORTER_OTLP_PROTOCOL
valueFrom:
configMapKeyRef:
key: OTEL_EXPORTER_OTLP_PROTOCOL
name: seldon-tracing
- name: SELDON_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: seldonio/seldon-dataflow-engine:latest
imagePullPolicy: Always
name: dataflow-engine
resources:
limits:
memory: 3G
requests:
cpu: 100m
memory: 1G
ports:
- containerPort: 8000
name: health
startupProbe:
failureThreshold: 10
httpGet:
path: /startup
port: health
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: health
periodSeconds: 5
livenessProbe:
failureThreshold: 3
httpGet:
path: /live
port: health
periodSeconds: 5
volumeMounts:
- mountPath: /mnt/schema-registry
name: kafka-schema-volume
readOnly: true
serviceAccountName: seldon-scheduler
terminationGracePeriodSeconds: 5
volumes:
- secret:
secretName: confluent-schema
optional: true
name: kafka-schema-volume
- name: seldon-envoy
replicas: 1
annotations:
"prometheus.io/path": "/stats/prometheus"
"prometheus.io/port": "9003"
"prometheus.io/scrape": "true"
podSpec:
containers:
- image: seldonio/seldon-envoy:latest
imagePullPolicy: Always
name: envoy
ports:
- containerPort: 9000
name: http
- containerPort: 9003
name: envoy-stats
resources:
limits:
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
readinessProbe:
httpGet:
path: /ready
port: envoy-stats
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
terminationGracePeriodSeconds: 5
- name: hodometer
replicas: 1
podSpec:
containers:
- env:
- name: PUBLISH_URL
value: http://hodometer.seldon.io
- name: SCHEDULER_HOST
value: seldon-scheduler
- name: SCHEDULER_PLAINTXT_PORT
value: "9004"
- name: SCHEDULER_TLS_PORT
value: "9044"
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: seldonio/seldon-hodometer:latest
imagePullPolicy: Always
name: hodometer
resources:
limits:
memory: 32Mi
requests:
cpu: 1m
memory: 32Mi
serviceAccountName: hodometer
terminationGracePeriodSeconds: 5
- name: seldon-modelgateway
replicas: 1
podSpec:
containers:
- args:
- --scheduler-host=seldon-scheduler
- --scheduler-plaintxt-port=$(SELDON_SCHEDULER_PLAINTXT_PORT)
- --scheduler-tls-port=$(SELDON_SCHEDULER_TLS_PORT)
- --envoy-host=seldon-mesh
- --envoy-port=80
- --kafka-config-path=/mnt/kafka/kafka.json
- --tracing-config-path=/mnt/tracing/tracing.json
- --log-level=$(LOG_LEVEL)
- --health-probe-port=$(HEALTH_PROBE_PORT)
command:
- /bin/modelgateway
env:
- name: SELDON_SCHEDULER_PLAINTXT_PORT
value: "9004"
- name: SELDON_SCHEDULER_TLS_PORT
value: "9044"
- name: MODELGATEWAY_MAX_NUM_CONSUMERS
value: "100"
- name: LOG_LEVEL
value: "warn"
- name: HEALTH_PROBE_PORT
value: "9999"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: seldonio/seldon-modelgateway:latest
imagePullPolicy: Always
name: modelgateway
resources:
limits:
memory: 1G
requests:
cpu: 100m
memory: 1G
ports:
- containerPort: 9999
name: health
protocol: TCP
startupProbe:
httpGet:
path: /startup
port: health
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /ready
port: health
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /live
port: health
periodSeconds: 5
failureThreshold: 3
volumeMounts:
- mountPath: /mnt/kafka
name: kafka-config-volume
- mountPath: /mnt/tracing
name: tracing-config-volume
- mountPath: /mnt/schema-registry
name: kafka-schema-volume
readOnly: true
serviceAccountName: seldon-scheduler
terminationGracePeriodSeconds: 5
volumes:
- configMap:
name: seldon-kafka
name: kafka-config-volume
- configMap:
name: seldon-tracing
name: tracing-config-volume
- secret:
secretName: confluent-schema
optional: true
name: kafka-schema-volume
- name: seldon-pipelinegateway
replicas: 1
podSpec:
containers:
- args:
- --http-port=9010
- --grpc-port=9011
- --metrics-port=9006
- --scheduler-host=seldon-scheduler
- --scheduler-plaintxt-port=$(SELDON_SCHEDULER_PLAINTXT_PORT)
- --scheduler-tls-port=$(SELDON_SCHEDULER_TLS_PORT)
- --envoy-host=seldon-mesh
- --envoy-port=80
- --kafka-config-path=/mnt/kafka/kafka.json
- --tracing-config-path=/mnt/tracing/tracing.json
- --log-level=$(LOG_LEVEL)
- --health-probe-port=$(HEALTH_PROBE_PORT)
command:
- /bin/pipelinegateway
env:
- name: SELDON_SCHEDULER_PLAINTXT_PORT
value: "9004"
- name: SELDON_SCHEDULER_TLS_PORT
value: "9044"
- name: LOG_LEVEL
value: "warn"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: HEALTH_PROBE_PORT
value: "9999"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: seldonio/seldon-pipelinegateway
imagePullPolicy: Always
name: pipelinegateway
ports:
- containerPort: 9010
name: http
protocol: TCP
- containerPort: 9011
name: grpc
protocol: TCP
- containerPort: 9006
name: metrics
protocol: TCP
- containerPort: 9999
name: health
protocol: TCP
resources:
limits:
memory: 1G
requests:
cpu: 100m
memory: 1G
startupProbe:
httpGet:
path: /startup
port: health
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /ready
port: health
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /live
port: health
periodSeconds: 5
failureThreshold: 3
volumeMounts:
- mountPath: /mnt/kafka
name: kafka-config-volume
- mountPath: /mnt/tracing
name: tracing-config-volume
- mountPath: /mnt/schema-registry
name: kafka-schema-volume
readOnly: true
serviceAccountName: seldon-scheduler
terminationGracePeriodSeconds: 5
volumes:
- configMap:
name: seldon-kafka
name: kafka-config-volume
- configMap:
name: seldon-tracing
name: tracing-config-volume
- secret:
secretName: confluent-schema
optional: true
name: kafka-schema-volume
- name: seldon-scheduler
replicas: 1
podSpec:
containers:
- args:
- --pipeline-gateway-host=seldon-pipelinegateway
- --tracing-config-path=/mnt/tracing/tracing.json
- --db-path=/mnt/scheduler/db
- --allow-plaintxt=$(ALLOW_PLAINTXT)
- --kafka-config-path=/mnt/kafka/kafka.json
- --scaling-config-path=/mnt/scaling/scaling.yaml
- --scheduler-ready-timeout-seconds=$(SCHEDULER_READY_TIMEOUT_SECONDS)
- --server-packing-enabled=$(SERVER_PACKING_ENABLED)
- --server-packing-percentage=$(SERVER_PACKING_PERCENTAGE)
- --envoy-accesslog-path=$(ENVOY_ACCESSLOG_PATH)
- --enable-envoy-accesslog=$(ENABLE_ENVOY_ACCESSLOG)
- --include-successful-requests-envoy-accesslog=$(INCLUDE_SUCCESSFUL_REQUESTS_ENVOY_ACCESSLOG)
- --enable-model-autoscaling=$(ENABLE_MODEL_AUTOSCALING)
- --enable-server-autoscaling=$(ENABLE_SERVER_AUTOSCALING)
- --log-level=$(LOG_LEVEL)
- --health-probe-port=$(HEALTH_PROBE_PORT)
- --enable-pprof=$(ENABLE_PPROF)
- --pprof-port=$(PPROF_PORT)
- --pprof-block-rate=$(PPROF_BLOCK_RATE)
- --pprof-mutex-rate=$(PPROF_MUTEX_RATE)
- --retry-creating-failed-pipelines-tick=$(RETRY_CREATING_FAILED_PIPELINES_TICK)
- --retry-deleting-failed-pipelines-tick=$(RETRY_DELETING_FAILED_PIPELINES_TICK)
- --max-retry-failed-pipelines=$(MAX_RETRY_FAILED_PIPELINES)
command:
- /bin/scheduler
env:
- name: ALLOW_PLAINTXT
value: "true"
- name: SCHEDULER_READY_TIMEOUT_SECONDS
value: 600
- name: SERVER_PACKING_ENABLED
value: "false"
- name: SERVER_PACKING_PERCENTAGE
value: "0.0"
- name: ENVOY_ACCESSLOG_PATH
value: /tmp/envoy-accesslog.txt
- name: ENABLE_ENVOY_ACCESSLOG
value: "true"
- name: INCLUDE_SUCCESSFUL_REQUESTS_ENVOY_ACCESSLOG
value: "false"
- name: ENABLE_MODEL_AUTOSCALING
value: "false"
- name: ENABLE_SERVER_AUTOSCALING
value: "true"
- name: LOG_LEVEL
value: "warn"
- name: MODELGATEWAY_MAX_NUM_CONSUMERS
value: "100"
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: HEALTH_PROBE_PORT
value: "9999"
- name: ENABLE_PPROF
value: "false"
- name: PPROF_PORT
value: "6060"
- name: PPROF_BLOCK_RATE
value: "0"
- name: PPROF_MUTEX_RATE
value: "0"
- name: RETRY_CREATING_FAILED_PIPELINES_TICK
value: "60s"
- name: RETRY_DELETING_FAILED_PIPELINES_TICK
value: "60s"
- name: MAX_RETRY_FAILED_PIPELINES
value: "10"
image: seldonio/seldon-scheduler:latest
imagePullPolicy: Always
name: scheduler
ports:
- containerPort: 9002
name: xds
- containerPort: 9004
name: scheduler
- containerPort: 9044
name: scheduler-mtls
- containerPort: 9005
name: agent
- containerPort: 9055
name: agent-mtls
- containerPort: 9008
name: dataflow
- containerPort: 9999
name: health
protocol: TCP
readinessProbe:
httpGet:
path: /ready
port: health
periodSeconds: 5
failureThreshold: 3
initialDelaySeconds: 10
livenessProbe:
httpGet:
path: /live
port: health
periodSeconds: 5
failureThreshold: 3
initialDelaySeconds: 10
resources:
limits:
memory: 1G
requests:
cpu: 100m
memory: 1G
volumeMounts:
- mountPath: /mnt/kafka
name: kafka-config-volume
- mountPath: /mnt/scaling
name: scaling-config-volume
- mountPath: /mnt/tracing
name: tracing-config-volume
- mountPath: /mnt/scheduler
name: scheduler-state
serviceAccountName: seldon-scheduler
terminationGracePeriodSeconds: 5
volumes:
- configMap:
name: seldon-scaling
name: scaling-config-volume
- configMap:
name: seldon-kafka
name: kafka-config-volume
- configMap:
name: seldon-tracing
name: tracing-config-volume
volumeClaimTemplates:
- name: scheduler-state
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1G
kubectl apply -f kafka-secret.yaml -n seldonapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: join
spec:
steps:
- name: tfsimple1
- name: tfsimple2
- name: tfsimple3
inputs:
- tfsimple1.outputs.OUTPUT0
- tfsimple2.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple2.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple3
seldon_model_infer_total{
code="200",
container="agent",
endpoint="metrics",
instance="10.244.0.39:9006",
job="seldon-mesh/agent",
method_type="rest",
model="irisa0",
model_internal="irisa0_1",
namespace="seldon-mesh",
pod="mlserver-0",
server="mlserver",
server_replica="0"
} metrics:
- type: Object
object:
describedObject:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
metric:
name: infer_rps
selector:
matchLabels:
method_type: rest
target:
type: AverageValue
averageValue: "3"apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: irisa0
namespace: seldon-mesh
spec:
memory: 3M
replicas: 1
requirements:
- sklearn
storageUri: gs://seldon-models/testing/iris1apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: irisa0-model-hpa
namespace: seldon-mesh
spec:
scaleTargetRef:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
minReplicas: 1
maxReplicas: 3
metrics:
- type: Object
object:
metric:
name: infer_rps
describedObject:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
name: irisa0
target:
type: AverageValue
averageValue: 3sum by (namespace) (
rate (
seldon_model_infer_total{namespace="seldon-mesh"}[2m]
)
)sum by (pod) (
rate (
seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[2m]
)
){
"name": "text",
"version": "text",
"extensions": [
"text"
]
}GET /v2/models/{model_name}/versions/{model_version} HTTP/1.1
Host:
Accept: */*
{
"name": "text",
"versions": [
"text"
],
"platform": "text",
"inputs": [
{
"name": "text",
"datatype": "text",
"shape": [
1
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
],
"outputs": [
{
"name": "text",
"datatype": "text",
"shape": [
1
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}GET /v2/models/{model_name} HTTP/1.1
Host:
Accept: */*
{
"name": "text",
"versions": [
"text"
],
"platform": "text",
"inputs": [
{
"name": "text",
"datatype": "text",
"shape": [
1
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
],
"outputs": [
{
"name": "text",
"datatype": "text",
"shape": [
1
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
],
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}GET /v2 HTTP/1.1
Host:
Accept: */*
What it scales with (max maxShardCountMultiplier = #partitions )
Max replicas used
Dataflow engine
#pipelines × maxShardCountMultiplier (capped by configured replicas)
min(replicas, pipelines × partitions)
Model gateway
#models × maxShardCountMultiplier (capped by replicas and maxNumConsumers)
min(replicas, min(models, maxNumConsumers) × partitions)
Pipeline gateway
#pipelines × maxShardCountMultiplier (capped by replicas and maxNumConsumers)
min(replicas, min(pipelines, maxNumConsumers) × partitions)
Each pipeline/model is loaded only on a subset of replicas, and automatically rebalanced when:
You scale replicas up/down
You deploy or delete pipelines / models
The configuration parameter determining the maximum number of component replicas on which a pipeline/model can be loaded is maxShardCountMultiplier. It can be set in SeldonConfig under config.scalingConfig.pipelines.maxShardCountMultiplier. For installs via helm, the value of this parameter defaults to {{ .Values.kafka.topics.numPartitions }}. In fact, the number of kafka partitions per topic is the maximum value that maxShardCountMultiplier should be set to. Increasing this value beyond the number of kafka partitions not only does not bring any additional performance benefits, but may actually lead to dropped requests due to the extra replicas receiving requests but managing no kafka partitions within their consumer groups.
This parameter may be changed during cluster operation, with the new value being propagated to all components over a ~1 minute interval. Changing this value will cause pipelines/models to be rebalanced across the dataflow-engine, model-gateway, and pipeline-gateway replicas and may lead to downtime depending on the configured kafka partition assignment strategy. If the used Kafka version supports cooperative rebalancing of consumer groups, then setting the partition assignment strategy to cooperative-sticky will ensure that rebalancing happens with minimal disruption. dataflow-engine uses a cooperative rebalancing strategy by default.
You do not need to manually assign work — it’s handled automatically.
Dataflow engine is responsible for executing pipeline logic and moving data between pipeline stages. Core 2 now supports running multiple pipelines in parallel across multiple dataflow engine replicas.
You control scaling using:
spec.replicas
SeldonRuntime
Maximum number of dataflow engine instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per pipeline (max possible: #Kafka partitions)
Dataflow engine replicas are dynamically adjusted based on number of pipelines deployed and Kafka partitions. The final number of dataflow engine replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \text{pipelines} \times \text{partitions})$
Example
3
4
9
min(9, 3 x 4 = 12) → 9 replicas
2
4
9
min(9, 2 x 4 = 8) → 8 replicas
1
4
Note: Unused replicas are automatically scaled down. As more pipelines are added, dataflow engine automatically scales up, capped by the maximum number of replicas.
Core 2 uses consistent hashing to distribute pipelines evenly across dataflow replicas. This ensures a balanced workload, but it does not guarantee a perfect one-to-one mapping.
Even if the number of replicas equals pipelines × partitions, small imbalances between the number of pipelines handled by each replica may exist. In practice, the distribution is statistically uniform.
Each pipeline is replicated across multiple dataflow engines (up to number of Kafka partitions).
When instances are added or removed, pipelines are automatically rebalanced.
Note: This process is handled internally by Core 2, so no manual intervention is needed.
Loading/unloading of the pipeline from the dataflow engine is performed when the pipeline CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the Pipeline status under the CR.
Rebalancing happens in the background — you don’t need to intervene.
Note: For pipelines, Pipeline ready status must be satisfied in order for the pipeline to be marked ready.
The model gateway is responsible for routing inference requests to models when used inside pipelines. Like the dataflow engine, it scales dynamically based on how many models are deployed.
spec.replcias
SeldonRuntime
Maximum number of model gateway instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per model (#Kafka partitions)
maxNumConsumers
SeldonConfig - model gateway enc var (default: 100)
Caps how many distinct consumer groups can exist
Model gateway replicas are dynamically adjusted based on number of models deployed, Kafka partitions, and maxNumConsumers. The final number of model gateway replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{models}, \text{maxNumConsumers}) \times \text{partitions})$
Example
5
4
20
100
min(20, min(5, 100) x 4 = 20) = 20 → 20 replicas
1
4
20
100
Note: If you remove models, the model gateway automatically scales down, and if we add models, the model gateway automatically scales up, capped by the maximum number of replicas.
Model gateway doesn’t load every model on every replica but only on a subset of replicas. The same principle as for dataflow engine applies for model gateway (sharding through consistent hashing).
Loading/unloading of the model from the model gateway is performed when the model CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the ModelGw status under the CR.
Rebalancing happens in the background — you don’t need to intervene.
Note: ModelGw status does not represent a condition for the model to be available. If the loading was successful on the dedicated servers, the model itself is ready for inference.
The ModelGw status becomes relevant for pipelines, or whether the end user wants to perform inference via the async path (i.e., writing the requests in the model input topic and reading the responses from the model output topic from Kafka).
In the context of pipelines, the ModelReady status becomes a conjunction on whether the model is available on servers and if the model has been loaded successfully on the model gateway.
The pipeline gateway is responsible for writing the requests in the input topic of the pipeline, and wait for the response on the output topic. Like dataflow engine and model gateway, pipeline gateway can scale horizontally.
spec.replcias
SeldonRuntime
Maximum number of pipeline gateway instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per pipeline (# Kafka partitions)
maxNumConsumers
SeldonConfig - Pipeline gateway enc var (default: 100)
Caps how many distinct consumer groups can exist
Pipeline gateway replicas are dynamically adjusted based on number of pipelines deployed, Kafka partitions, and maxNumConsumers. The final number of pipeline gateway replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{pipelines}, \text{maxNumConsumers}) \times \text{partitions})$
Example
8
4
10
100
min(10, min(8, 100) x 4 = 32) = 10 → 10 replicas
2
4
10
100
Note: Similarly to dataflow engine, pipeline gateway scales up and down as pipeline are added and removed.
Pipeline gateway doesn’t load every pipeline on every replica but only on a subset of replicas. The same principle as for dataflow engine and model gateway applies for pipeline gateway (sharding through consistent hashing).
Loading/unloading of the pipeline from the pipeline gateway is performed when the pipeline CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the PipelineGw status under the CR.
Analogous with the previous services, rebalancing happens in the background — you don’t need to intervene.
Note: For pipelines, PipelineGw ready status must be satisfied in order for the pipeline to be marked ready.
Component
For making synchronous requests, the process will generally be:
Find the appropriate service endpoint (IP address and port) for accessing the installation of Seldon Core 2.
Determine the appropriate headers/metadata for the request.
Make requests via REST or gRPC.
In the default Docker Compose setup, container ports are accessible from the host machine. This means you can use localhost or 0.0.0.0 as the hostname.
The default port for sending inference requests to the Seldon system is 9000. This is controlled by the ENVOY_DATA_PORT environment variable for Compose.
Putting this together, you can send inference requests to 0.0.0.0:9000.
In Kubernetes, Seldon creates a single Service called seldon-mesh in the namespace it is installed into. By default, this namespace is also called seldon-mesh.
If this Service is exposed via a load balancer, the appropriate address and port can be found via:
If you are not using a LoadBalancer for the seldon-mesh Service, you can still send inference requests.
For development and testing purposes, you can port-forward the Service locally using the below. Inference requests can then be sent to localhost:8080.
If you are using a service mesh like Istio or Ambassador, you will need to use the IP address of the service mesh ingress and determine the appropriate port.
Let us imagine making inference requests to a model called iris.
This iris model has the following schema, which can be set in a model-settings.json file for MLServer:
Examples are given below for some common tools for making requests.
An example seldon request might look like this:
The default inference mode is REST, but you can also send gRPC requests like this:
An example curl request might look like this:
An example grpcurl request might look like this:
The above request was run from the project root folder allowing reference to the Protobuf manifests defined in the apis/ folder.
You can use the Python package to send inference requests.
A short, self-contained example is:
Seldon needs to determine where to route requests to, as models and pipelines might have the same name. There are two ways of doing this: header-based routing (preferred) and path-based routing.
Seldon can route requests to the correct endpoint via headers in HTTP calls, both for REST (HTTP/1.1) and gRPC (HTTP/2).
Use the Seldon-Model header as follows:
For models, use the model name as the value. For example, to send requests to a model named foo use the header Seldon-Model: foo.
For pipelines, use the pipeline name followed by .pipeline as the value. For example, to send requests to a pipeline named foo use the header Seldon-Model: foo.pipeline.
The seldon CLI is aware of these rules and can be used to easily send requests to your deployed resources. See the and the for more information.
The inference v2 protocol is only aware of models, thus has no concept of pipelines. Seldon works around this limitation by introducing virtual endpoints for pipelines. Virtual means that Seldon understands them, but other v2 protocol-compatible components like inference servers do not.
Use the following rules for paths to route to models and pipelines:
For models, use the path prefix /v2/models/{model name}. This is normal usage of the inference v2 protocol.
For pipelines, you can use the path prefix /v2/pipelines/{pipeline name}. Otherwise calling pipelines looks just like the inference v2 protocol for models. Do not use any suffix for the pipeline name as you would for routing headers.
Extending our examples from above, the requests may look like the below when using header-based routing.
No changes are required as the seldon CLI already understands how to set the appropriate gRPC and REST headers.
Note the rpc-header flag in the penultimate line:
Note the headers dictionary in the client.infer() call:
If you are using an ingress controller to make inference requests with Seldon, you will need to configure the routing rules correctly.
There are many ways to do this, but custom path prefixes will not work with gRPC. This is because gRPC determines the path based on the Protobuf definition. Some gRPC implementations permit manipulating paths when sending requests, but this is by no means universal.
If you want to expose your inference endpoints via gRPC and REST in a consistent way, you should use virtual hosts, subdomains, or headers.
The downside of using only paths is that you cannot differentiate between different installations of Seldon Core 2 or between traffic to Seldon and any other inference endpoints you may have exposed via the same ingress.
You might want to use a mixture of these methods; the choice is yours.
Virtual hosts are a way of differentiating between logical services accessed via the same physical machine(s).
Virtual hosts are defined by the Host header for HTTP/1 and the :authority pseudo-header for HTTP/2. These represent the same thing, and the HTTP/2 specification defines how to translate these when converting between protocol versions.
Many tools and libraries treat these headers as special and have particular ways of handling them. Some common ones are given below:
The seldon CLI has an --authority flag which applies to both REST and gRPC inference calls.
curl accepts Host as a normal header.
grpcurl has an -authority flag.
In Go, the standard library's http.Request struct has a Host field and ignores attempts to set this value via headers.
In Python, the requests library accepts the host as a normal header.
Be sure to check the documentation for how to set this with your preferred tools and languages.
Subdomain names constitute a part of the overall host name. As such, specifying a subdomain name for requests will involve setting the appropriate host in the URI.
For example, you may expose inference services in the namespaces seldon-1 and seldon-2 as in the following snippets:
Many popular ingresses support subdomain-based routing, including Istio and Nginx. Please refer to the documentation for your ingress of choice for further information.
Many ingress controllers and service meshes support routing on headers. You can use whatever headers you prefer, so long as they do not conflict with any Seldon relies upon.
Many tools and libraries support adding custom headers to requests. Some common ones are given below:
The seldon CLI accepts headers using the --header flag, which can be specified multiple times.
curl accepts headers using the -H or --header
It is possible to route on paths by using well-known path prefixes defined by the inference v2 protocol. For gRPC, the full path (or "method") for an inference call is:
This corresponds to the package (inference), service (GRPCInferenceService), and RPC name (ModelInfer) in the Protobuf definition of the inference v2 protocol.
You could use an exact match or a regex like .*inference.* to match this path, for example.
The Seldon architecture uses Kafka and therefore asynchronous requests can be sent by pushing inference v2 protocol payloads to the appropriate topic. Topics have the following form:
For a local install if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.default.model.iris.inputs. The response will appear on seldon.default.model.iris.outputs.
For a Kubernetes install in seldon-mesh if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.model.iris.inputs. The response will appear on seldon.seldon-mesh.model.iris.outputs.
For a local install if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.default.pipeline.mypipeline.inputs. The response will appear on seldon.default.pipeline.mypipeline.outputs.
For a Kubernetes install in seldon-mesh if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.pipeline.mypipeline.inputs. The response will appear on seldon.seldon-mesh.pipeline.mypipeline.outputs.
It may be useful to send metadata alongside your inference.
If using Kafka directly as described above, you can attach Kafka metadata to your request, which will be passed around the graph. When making synchronous requests to your pipeline with REST or gRPC you can also do this.
For REST requests add HTTP headers prefixed with X-
For gRPC requests add metadata with keys starting with X-
You can also do this with the Seldon CLI by setting headers with the --header argument (and also showing response headers with the --show-headers argument)
For pipeline inference, the response also contains a x-pipeline-version header, indicating which version of pipeline it ran inference with.
For both model and pipeline requests the response will contain a x-request-id response header. For pipeline requests this can be used to inspect the pipeline steps via the CLI, e.g.:
The --offset parameter specifies how many messages (from the latest) you want to search to find your request. If not specified the last request will be shown.
x-request-id will also appear in tracing spans.
If x-request-id is passed in by the caller then this will be used. It is the caller's responsibility to ensure it is unique.
The IDs generated are XIDs.
livenessProbeOK
No content
OK
No content
OK
No content
OK
No content
OK
No content
OK
Bad Request
Not Found
Internal Server Error
OK
Bad Request
Not Found
Internal Server Error
Examples of various model artifact types from various frameworks running under Seldon Core 2.
SKlearn
Tensorflow
XGBoost
ONNX
Lightgbm
MLFlow
PyTorch
Python requirements in model-zoo-requirements.txt
The training code for this model can be found at scripts/models/iris in SCv2 repo.
The training code for this model can be found at ./scripts/models/income-xgb
This model is a pretrained model as defined in ./scripts/models/Makefile target mnist-onnx
The training code for this model can be found at ./scripts/models/income-lgb
The training code for this model can be found at ./scripts/models/wine-mlflow
This example model is downloaded and trained in ./scripts/models/Makefile target mnist-pytorch
We use a simple sklearn iris classification model
Load the model
kubectl apply -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris createdseldon model load -f ./models/sklearn-iris-gs.yaml{}Wait for the model to be ready
kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}model.mlops.seldon.io/iris condition metseldon model status iris -w ModelAvailable | jq -M .{}Do a REST inference call
Do a gRPC inference call
Unload the model
We run a simple tensorflow model. Note the requirements section specifying tensorflow.
Load the model.
Wait for the model to be ready.
Get model metadata
Do a REST inference call.
Do a gRPC inference call
Unload the model
We will use two SKlearn Iris classification models to illustrate an experiment.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Run one more request
Use sticky session key passed by last infer request to ensure same route is taken each time. We will test REST and gRPC.
gPRC
Stop the experiment
Show the requests all go to original model now.
Unload both models.
seldon model infer iris \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'seldon model infer iris \
--inference-mode grpc \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'curl -v http://0.0.0.0:9000/v2/models/iris/infer \
-H "Content-Type: application/json" \
-d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'grpcurl \
-d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
-plaintext \
-import-path apis \
-proto apis/mlops/v2_dataplane/v2_dataplane.proto \
0.0.0.0:9000 inference.GRPCInferenceService/ModelInfer:emphasize-lines: 4
curl -v http://0.0.0.0:9000/v2/models/iris/infer \
-H "Content-Type: application/json" \
-d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' \
-H "Seldon-Model: iris":emphasize-lines: 6
grpcurl \
-d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
-plaintext \
-import-path apis \
-proto apis/mlops/v2_dataplane/v2_dataplane.proto \
-rpc-header seldon-model:iris \
0.0.0.0:9000 inference.GRPCInferenceService/ModelInfer:emphasize-lines: 18
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(
url="localhost:8080",
verbose=False,
)
inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
inputs[0].set_data_from_numpy(
np.array([[1, 2, 3, 4]]).astype("float64"),
binary_data=False,
)
result = client.infer(
"iris",
inputs,
headers={"Seldon-Model": "iris"},
)
print("result is:", result.as_numpy("predict")){
"name": "iris",
"implementation": "mlserver_sklearn.SKLearnModel",
"inputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [-1, 4]
}
],
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [-1, 1]
}
],
"parameters": {
"version": "1"
}
}seldon.<namespace>.<model|pipeline>.<name>.<inputs|outputs>seldon pipeline infer --show-headers --header X-foo=bar tfsimples \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'seldon pipeline inspect tfsimples --request-id carjjolvqj3j2pfbut10 --offset 10GET /v2/health/live HTTP/1.1
Host:
Accept: */*
GET /v2/health/ready HTTP/1.1
Host:
Accept: */*
POST /v2/models/{model_name}/versions/{model_version}/infer HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 371
{
"id": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"inputs": [
{
"name": "text",
"shape": [
1
],
"datatype": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"data": null
}
],
"outputs": [
{
"name": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
]
}POST /v2/models/{model_name}/infer HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 371
{
"id": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"inputs": [
{
"name": "text",
"shape": [
1
],
"datatype": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"data": null
}
],
"outputs": [
{
"name": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
}
}
]
}cat ./models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Ki
config:
kafkaConfig:
producer:
compression.type: gzip
9
min(9. 1 x 4 = 4) → 4 replicas
min(20, min(1, 100) x 4 = 4) = 4 → 4 replicas
min(10, min(2, 100) x 4 = 8) = 8 → 8 replicas
1
4
10
100
min(10, min(1, 100) x 4 = 4) = 4→ 4 replicas
For pipelines, you can also use the path prefix /v2/models/{pipeline name}.pipeline. Again, this form looks just like the inference v2 protocol for models.
grpcurl accepts headers using the -H flag, which can be specified multiple times.
kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}'kubectl port-forward svc/seldon-mesh -n seldon-mesh 8080:80import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(
url="localhost:8080",
verbose=False,
)
inputs = [httpclient.InferInput("predict", (1, 4), "FP64")]
inputs[0].set_data_from_numpy(
np.array([[1, 2, 3, 4]]).astype("float64"),
binary_data=False,
)
result = client.infer("iris", inputs)
print("result is:", result.as_numpy("predict"))curl https://seldon-1.example.com/v2/models/iris/infer ...
seldon model infer --inference-host https://seldon-2.example.com/v2/models/iris/infer .../inference.GRPCInferenceService/ModelInferGET /v2/models/{model_name}/ready HTTP/1.1
Host:
Accept: */*
{
"model_name": "text",
"model_version": "text",
"id": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"outputs": [
{
"name": "text",
"shape": [
1
],
"datatype": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"data": null
}
]
}{
"model_name": "text",
"model_version": "text",
"id": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"outputs": [
{
"name": "text",
"shape": [
1
],
"datatype": "text",
"parameters": {
"content_type": "text",
"headers": {},
"ANY_ADDITIONAL_PROPERTY": "anything"
},
"data": null
}
]
}curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "983bd95f-4b4d-4ff1-95b2-df9d6d089164",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}seldon model infer iris \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "983bd95f-4b4d-4ff1-95b2-df9d6d089164",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}seldon model infer iris --inference-mode grpc \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .{
"modelName": "iris_1",
"modelVersion": "1",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "np"
}
},
"contents": {
"int64Contents": [
"2"
]
}
}
]
}
kubectl delete -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deletedseldon model unload iriscat ./models/tfsimple1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl apply -f ./models/tfsimple1.yaml -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 createdseldon model load -f ./models/tfsimple1.yaml{}kubectl wait --for condition=ready --timeout=300s model tfsimple1 -n ${NAMESPACE}model.mlops.seldon.io/tfsimple1 condition metseldon model status tfsimple1 -w ModelAvailable | jq -M .{}curl --location 'http://${SELDON_INFER_HOST}/v2/models/tfsimple1'{
"name": "tfsimple1_1",
"versions": [
"1"
],
"platform": "tensorflow_graphdef",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [
-1,
16
]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [
-1,
16
]
}
],
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
-1,
16
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
-1,
16
]
}
]
}
seldon model metadata tfsimple1{
"name": "tfsimple1_1",
"versions": [
"1"
],
"platform": "tensorflow_graphdef",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [
-1,
16
]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [
-1,
16
]
}
],
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
-1,
16
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
-1,
16
]
}
]
}
curl --location 'http://${SELDON_INFER_HOST}/v2/models/tfsimple1/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
}
seldon model infer tfsimple1 \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
}
seldon model infer tfsimple1 --inference-mode grpc \
'{"model_name":"tfsimple1","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"modelName": "tfsimple1_1",
"modelVersion": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
}
]
}
kubectl delete -f ./models/tfsimple1.yaml -n ${NAMESPACE}model.mlops.seldon.io "tfsimple1" deletedseldon model unload tfsimple1cat ./models/sklearn1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
cat ./models/sklearn2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris2
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
kubectl apply -f ./models/sklearn1.yaml -n ${NAMESPACE}
kubectl apply -f ./models/sklearn2.yaml -n ${NAMESPACE}model.mlops.seldon.io/sklearn1 created
model.mlops.seldon.io/sklearn2 createdseldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml{}
{}kubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model iris2 -n ${NAMESPACE}model.mlops.seldon.io/iris condition met
model.mlops.seldon.io/iris2 condition metseldon model status iris | jq -M .
seldon model status iris2 | jq -M .{
"modelName": "iris",
"versions": [
{
"version": 1,
"serverName": "mlserver",
"kubernetesMeta": {},
"modelReplicaState": {
"0": {
"state": "Available",
"lastChangeTimestamp": "2023-06-29T14:01:41.362720538Z"
}
},
"state": {
"state": "ModelAvailable",
"availableReplicas": 1,
"lastChangeTimestamp": "2023-06-29T14:01:41.362720538Z"
},
"modelDefn": {
"meta": {
"name": "iris",
"kubernetesMeta": {}
},
"modelSpec": {
"uri": "gs://seldon-models/mlserver/iris",
"requirements": [
"sklearn"
]
},
"deploymentSpec": {
"replicas": 1
}
}
}
]
}
{
"modelName": "iris2",
"versions": [
{
"version": 1,
"serverName": "mlserver",
"kubernetesMeta": {},
"modelReplicaState": {
"0": {
"state": "Available",
"lastChangeTimestamp": "2023-06-29T14:01:41.362845079Z"
}
},
"state": {
"state": "ModelAvailable",
"availableReplicas": 1,
"lastChangeTimestamp": "2023-06-29T14:01:41.362845079Z"
},
"modelDefn": {
"meta": {
"name": "iris2",
"kubernetesMeta": {}
},
"modelSpec": {
"uri": "gs://seldon-models/mlserver/iris",
"requirements": [
"sklearn"
]
},
"deploymentSpec": {
"replicas": 1
}
}
}
]
}
cat ./experiments/ab-default-model.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
seldon experiment start -f ./experiments/ab-default-model.yamlseldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 100 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::57 :iris_1::43]
curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'seldon model infer iris \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "fa425bdf-737c-41fe-894d-58868f70fe5d",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
curl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'seldon model infer iris -s -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris --inference-mode grpc -s -i 50\
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'Success: map[:iris_1::50]
seldon experiment stop experiment-samplecurl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'seldon model infer iris -i 100 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::100]
kubectl delete -f ./models/sklearn1.yaml -n ${NAMESPACE}
kubectl delete -f ./models/sklearn2.yaml -n ${NAMESPACE}seldon model unload iris
seldon model unload iris2apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimples
spec:
steps:
- name: tfsimple1
- name: tfsimple2
inputs:
- tfsimple1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple1.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple2
cat ./models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Ki
kubectl apply -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris createdkubectl wait --for condition=ready --timeout=300s model iris -n ${NAMESPACE}model.mlops.seldon.io/iris condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/iris/infer' \
--header 'Content-Type: application/json' \
--data '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "09263298-ca66-49c5-acb9-0ca75b06f825",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"data": [
2
]
}
]
}
kubectl delete -f ./models/sklearn-iris-gs.yaml -n ${NAMESPACE}seldon model load -f ./models/sklearn-iris-gs.yaml{}seldon model status iris -w ModelAvailable | jq -M .{}seldon model infer iris \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'{
"model_name": "iris_1",
"model_version": "1",
"id": "09263298-ca66-49c5-acb9-0ca75b06f825",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"data": [
2
]
}
]
}
seldon model unload iris{}
import requests
import json
from typing import Dict, List
import numpy as np
import os
import tensorflow as tf
from alibi_detect.utils.perturbation import apply_mask
from alibi_detect.datasets import fetch_cifar10c
import matplotlib.pyplot as plt
tf.keras.backend.clear_session()2023-03-09 19:43:43.637892: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-09 19:43:43.637906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
train, test = tf.keras.datasets.cifar10.load_data()
X_train, y_train = train
X_test, y_test = test
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
classes = (
"plane",
"car",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
)
(50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)
reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"
def infer(resourceName: str, idx: int):
rows = X_train[idx:idx+1]
show(rows[0])
reqJson["inputs"][0]["data"] = rows.flatten().tolist()
reqJson["inputs"][0]["shape"] = [1, 32, 32, 3]
headers = {"Content-Type": "application/json", "seldon-model":resourceName}
response_raw = requests.post(url, json=reqJson, headers=headers)
probs = np.array(response_raw.json()["outputs"][0]["data"])
print(classes[probs.argmax(axis=0)])
def show(X):
plt.imshow(X.reshape(32, 32, 3))
plt.axis("off")
plt.show()
cat ./models/cifar10-no-config.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10
spec:
storageUri: "gs://seldon-models/scv2/samples/tensorflow/cifar10"
requirements:
- tensorflow
kubectl apply -f ./models/cifar10-no-config.yaml -n ${NAMESPACE}model.mlops.seldon.io/cifar10 createdkubectl wait --for condition=ready --timeout=300s model cifar10 -n ${NAMESPACE}model.mlops.seldon.io/cifar10 condition metseldon model load -f ./models/cifar10-no-config.yaml{}seldon model status cifar10 -w ModelAvailable | jq -M .{}infer("cifar10",4)car
kubectl delete -f ./models/cifar10-no-config.yaml -n ${NAMESPACE}seldon model unload cifar10{}cat ./models/income-xgb.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-xgb
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/income-xgb"
requirements:
- xgboost
kubectl apply -f ./models/income-xgb.yaml -n ${NAMESPACE}model.mlops.seldon.io/income-xgb is createdkubectl wait --for condition=ready --timeout=300s model income-xgb -n ${NAMESPACE}model.mlops.seldon.io/income-xgb condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/income-xgb/infer' \
--header 'Content-Type: application/json' \
--data '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'{
"model_name": "income-lgb_1",
"model_version": "1",
"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "FP64",
"data": [
0.06279460120044741
]
}
]
}kubectl delete -f ./models/income-xgb.yaml -n ${NAMESPACE}seldon model load -f ./models/income-xgb.yaml{}seldon model status income-xgb -w ModelAvailable | jq -M .{}seldon model infer income-xgb \
'{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'{
"model_name": "income-xgb_1",
"model_version": "1",
"id": "e30c3b44-fa14-4e5f-88f5-d6f4d287da20",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "FP32",
"data": [
-1.8380107879638672
]
}
]
}
seldon model unload income-xgb{}import matplotlib.pyplot as plt
import json
import requests
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torchvision import transforms
from torch.utils.data import DataLoader
import numpy as np
training_data = MNIST(
root=".",
download=True,
train=False,
transform = transforms.Compose([
transforms.ToTensor()
])
)
reqJson = json.loads('{"inputs":[{"name":"Input3","data":[],"datatype":"FP32","shape":[]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"
dl = DataLoader(training_data, batch_size=1, shuffle=False)
dlIter = iter(dl)
def infer_mnist():
x, y = next(dlIter)
data = x.cpu().numpy()
reqJson["inputs"][0]["data"] = data.flatten().tolist()
reqJson["inputs"][0]["shape"] = [1, 1, 28, 28]
headers = {"Content-Type": "application/json", "seldon-model":"mnist-onnx"}
response_raw = requests.post(url, json=reqJson, headers=headers)
show_mnist(x)
probs = np.array(response_raw.json()["outputs"][0]["data"])
print(probs.argmax(axis=0))
def show_mnist(X):
plt.imshow(X.reshape(28, 28))
plt.axis("off")
plt.show()cat ./models/mnist-onnx.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mnist-onnx
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mnist-onnx"
requirements:
- onnx
kubectl apply -f ./models/mnist-onnx.yaml -n ${NAMESPACE}model.mlops.seldon.io/mnist-onnx createdkubectl wait --for condition=ready --timeout=300s model mnist-onnx -n ${NAMESPACE}model.mlops.seldon.io/mnist-onnx condition metseldon model load -f ./models/mnist-onnx.yaml{}seldon model status mnist-onnx -w ModelAvailable | jq -M .{}infer_mnist()7
kubectl delete -f ./models/mnist-onnx.yaml -n ${NAMESPACE}seldon model unload mnist-onnx{}cat ./models/income-lgb.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-lgb
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/income-lgb"
requirements:
- lightgbm
kubectl apply -f ./models/income-lgb.yaml -n ${NAMESPACE}model.mlops.seldon.io/income-lgb createdkubectl wait --for condition=ready --timeout=300s model income-lgb -n ${NAMESPACE}model.mlops.seldon.io/income-lgb condition metcurl --location 'http://${SELDON_INFER_HOST}/v2/models/income-lgb/infer' \
--header 'Content-Type: application/json' \
--data '{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'{
"model_name": "income-lgb_1",
"model_version": "1",
"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "FP64",
"data": [
0.06279460120044741
]
}
]
}kubectl delete -f ./models/income-lgb.yaml -n ${NAMESPACE}seldon model load -f ./models/income-lgb.yaml{}seldon model status income-lgb -w ModelAvailable | jq -M .{}seldon model infer income-lgb \
'{ "parameters": {"content_type": "pd"}, "inputs": [{"name": "Age", "shape": [1, 1], "datatype": "INT64", "data": [47]},{"name": "Workclass", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Education", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Marital Status", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Occupation", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Relationship", "shape": [1, 1], "datatype": "INT64", "data": [3]},{"name": "Race", "shape": [1, 1], "datatype": "INT64", "data": [4]},{"name": "Sex", "shape": [1, 1], "datatype": "INT64", "data": [1]},{"name": "Capital Gain", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Capital Loss", "shape": [1, 1], "datatype": "INT64", "data": [0]},{"name": "Hours per week", "shape": [1, 1], "datatype": "INT64", "data": [40]},{"name": "Country", "shape": [1, 1], "datatype": "INT64", "data": [9]}]}'{
"model_name": "income-lgb_1",
"model_version": "1",
"id": "4437a71e-9af1-4e3b-aa4b-cb95d2cd86b9",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "FP64",
"data": [
0.06279460120044741
]
}
]
}seldon model unload income-lgb{}cat ./models/wine-mlflow.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: wine
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/wine-mlflow"
requirements:
- mlflow
kubectl apply -f ./models/wine-mlflow.yaml -n ${NAMESPACE}model.mlops.seldon.io/wine createdkubectl wait --for condition=ready --timeout=300s model wine -n ${NAMESPACE}model.mlops.seldon.io/wine condition metseldon model load -f ./models/wine-mlflow.yaml{}seldon model status wine -w ModelAvailable | jq -M .{}import requests
url = "http://0.0.0.0:9000/v2/models/model/infer"
inference_request = {
"inputs": [
{
"name": "fixed acidity",
"shape": [1],
"datatype": "FP32",
"data": [7.4],
},
{
"name": "volatile acidity",
"shape": [1],
"datatype": "FP32",
"data": [0.7000],
},
{
"name": "citric acid",
"shape": [1],
"datatype": "FP32",
"data": [0],
},
{
"name": "residual sugar",
"shape": [1],
"datatype": "FP32",
"data": [1.9],
},
{
"name": "chlorides",
"shape": [1],
"datatype": "FP32",
"data": [0.076],
},
{
"name": "free sulfur dioxide",
"shape": [1],
"datatype": "FP32",
"data": [11],
},
{
"name": "total sulfur dioxide",
"shape": [1],
"datatype": "FP32",
"data": [34],
},
{
"name": "density",
"shape": [1],
"datatype": "FP32",
"data": [0.9978],
},
{
"name": "pH",
"shape": [1],
"datatype": "FP32",
"data": [3.51],
},
{
"name": "sulphates",
"shape": [1],
"datatype": "FP32",
"data": [0.56],
},
{
"name": "alcohol",
"shape": [1],
"datatype": "FP32",
"data": [9.4],
},
]
}
headers = {"Content-Type": "application/json", "seldon-model":"wine"}
response_raw = requests.post(url, json=inference_request, headers=headers)
print(response_raw.json()){'model_name': 'wine_1', 'model_version': '1', 'id': '0d7e44f8-b46c-4438-b8af-a749e6aa6039', 'parameters': {}, 'outputs': [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'data': [5.576883936610762]}]}
kubectl delete model wine -n ${NAMESPACE}seldon model unload wine{}
import numpy as np
import matplotlib.pyplot as plt
import json
import requests
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torchvision import transforms
from torch.utils.data import DataLoader
training_data = MNIST(
root=".",
download=True,
train=False,
transform = transforms.Compose([
transforms.ToTensor()
])
)
reqJson = json.loads('{"inputs":[{"name":"x__0","data":[],"datatype":"FP32","shape":[]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"
dl = DataLoader(training_data, batch_size=1, shuffle=False)
dlIter = iter(dl)
def infer_mnist():
x, y = next(dlIter)
data = x.cpu().numpy()
reqJson["inputs"][0]["data"] = data.flatten().tolist()
reqJson["inputs"][0]["shape"] = [1, 1, 28, 28]
headers = {"Content-Type": "application/json", "seldon-model":"mnist-pytorch"}
response_raw = requests.post(url, json=reqJson, headers=headers)
show_mnist(x)
probs = np.array(response_raw.json()["outputs"][0]["data"])
print(probs.argmax(axis=0))
def show_mnist(X):
plt.imshow(X.reshape(28, 28))
plt.axis("off")
plt.show()cat ./models/mnist-pytorch.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mnist-pytorch
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mnist-pytorch"
requirements:
- pytorch
kubectl apply -f ./models/mnist-pytorch.yaml -n ${NAMESPACE}model.mlops.seldon.io/mnist-pytorch createdkubectl wait --for condition=ready --timeout=300s model mnist-pytorch -n ${NAMESPACE}model.mlops.seldon.io/mnist-pytorch condition metseldon model load -f ./models/mnist-pytorch.yaml{}
seldon model status mnist-pytorch -w ModelAvailable | jq -M .{}
infer_mnist()7
kubectl delete -f ./models/mnist-pytorch.yaml -n ${NAMESPACE}model.mlops.seldon.io "mnist-pytorch" deletedseldon model unload mnist-pytorch{}
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import PandasCodec
from mlserver.errors import MLServerError
import pandas as pd
from fastapi import status
from mlserver.logging import logger
QUERY_KEY = "query"
class ModelParametersMissing(MLServerError):
def __init__(self, model_name: str, reason: str):
super().__init__(
f"Parameters missing for model {model_name} {reason}", status.HTTP_400_BAD_REQUEST
)
class PandasQueryRuntime(MLModel):
async def load(self) -> bool:
logger.info("Loading with settings %s", self.settings)
if self.settings.parameters is None or \
self.settings.parameters.extra is None:
raise ModelParametersMissing(self.name, "no settings.parameters.extra found")
self.query = self.settings.parameters.extra[QUERY_KEY]
if self.query is None:
raise ModelParametersMissing(self.name, "no settings.parameters.extra.query found")
self.ready = True
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
input_df: pd.DataFrame = PandasCodec.decode_request(payload)
# run query on input_df and save in output_df
output_df = input_df.query(self.query)
if output_df.empty:
output_df = pd.DataFrame({'status':["no rows satisfied " + self.query]})
else:
output_df["status"] = "row satisfied " + self.query
return PandasCodec.encode_response(self.name, output_df, self.version)cat ../../models/choice1.yaml
echo "---"
cat ../../models/choice2.yaml
echo "---"
cat ../../models/add10.yaml
echo "---"
cat ../../models/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-one
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 1"
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: choice-is-two
spec:
storageUri: "gs://seldon-models/scv2/examples/pandasquery"
requirements:
- mlserver
- python
parameters:
- name: query
value: "choice == 2"
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
kubectl apply -f ../../models/choice1.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/choice2.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/add10.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/mul10.yaml -n ${NAMESPACE}model.mlops.seldon.io/choice1 created
model.mlops.seldon.io/choice2 created
model.mlops.seldon.io/add10 created
model.mlops.seldon.io/mul10 createdseldon model load -f ../../models/choice1.yaml
seldon model load -f ../../models/choice2.yaml
seldon model load -f ../../models/add10.yaml
seldon model load -f ../../models/mul10.yaml{}
{}
{}
{}kubectl wait --for condition=ready --timeout=300s model choice-is-one -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model choice-is-two -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model add10 -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model mul10 -n ${NAMESPACE}model.mlops.seldon.io/choice-is-one condition met
model.mlops.seldon.io/choice-is-two condition met
model.mlops.seldon.io/add10 condition met
model.mlops.seldon.io/mul10 condition metseldon model status choice-is-one -w ModelAvailable
seldon model status choice-is-two -w ModelAvailable
seldon model status add10 -w ModelAvailable
seldon model status mul10 -w ModelAvailable{}
{}
{}
{}
cat ../../pipelines/choice.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: choice
spec:
steps:
- name: choice-is-one
- name: mul10
inputs:
- choice.inputs.INPUT
triggers:
- choice-is-one.outputs.choice
- name: choice-is-two
- name: add10
inputs:
- choice.inputs.INPUT
triggers:
- choice-is-two.outputs.choice
output:
steps:
- mul10
- add10
stepsJoin: any
kubectl apply -f pipelines/choice.yamlpipeline.mlops.seldon.io/choice createdkubectl wait --for condition=ready --timeout=300s pipelines choice -n ${NAMESPACE}pipeline.mlops.seldon.io/choice condition metseldon pipeline load -f ../../pipelines/choice.yamlseldon pipeline status choice -w PipelineReady | jq -M .{
"pipelineName": "choice",
"versions": [
{
"pipeline": {
"name": "choice",
"uid": "cifel9aufmbc73e5intg",
"version": 1,
"steps": [
{
"name": "add10",
"inputs": [
"choice.inputs.INPUT"
],
"triggers": [
"choice-is-two.outputs.choice"
]
},
{
"name": "choice-is-one"
},
{
"name": "choice-is-two"
},
{
"name": "mul10",
"inputs": [
"choice.inputs.INPUT"
],
"triggers": [
"choice-is-one.outputs.choice"
]
}
],
"output": {
"steps": [
"mul10.outputs",
"add10.outputs"
],
"stepsJoin": "ANY"
},
"kubernetesMeta": {}
},
"state": {
"pipelineVersion": 1,
"status": "PipelineReady",
"reason": "created pipeline",
"lastChangeTimestamp": "2023-06-30T14:45:57.284684328Z",
"modelsReady": true
}
}
]
}seldon pipeline infer choice --inference-mode grpc \
'{"model_name":"choice","inputs":[{"name":"choice","contents":{"int_contents":[1]},"datatype":"INT32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[5,6,7,8]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
50,
60,
70,
80
]
}
}
]
}
seldon pipeline infer choice --inference-mode grpc \
'{"model_name":"choice","inputs":[{"name":"choice","contents":{"int_contents":[2]},"datatype":"INT32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[5,6,7,8]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
15,
16,
17,
18
]
}
}
]
}
kubectl delete -f ../../models/choice1.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/choice2.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/add10.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/mul10.yaml -n ${NAMESPACE}
kubectl delete -f ../../pipelines/choice.yaml -n ${NAMESPACE}seldon model unload choice-is-one
seldon model unload choice-is-two
seldon model unload add10
seldon model unload mul10
seldon pipeline unload choice{
"topicPrefix": "seldon",
"bootstrap.servers":"kafka:9093",
"consumer":{
"session.timeout.ms":6000,
"auto.offset.reset":"earliest",
"topic.metadata.propagation.max.ms": 300000,
"message.max.bytes":1000000000
},
"producer":{
"linger.ms":0,
"message.max.bytes":1000000000
},
"streams":{
}
}
The outlier detector is created from the CIFAR10 VAE Outlier example.
The drift detector is created from the CIFAR10 KS Drift example
To run local training run the training notebook.
Use the seldon CLI to look at the outputs from the CIFAR10 model. It will decode the Triton binary outputs for us.

We will use two SKlearn Iris classification models to illustrate experiments.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Show sticky session header x-seldon-route that is returned
Use sticky session key passed by last infer request to ensure same route is taken each time.
Stop the experiment
Unload both models.
Use sticky session key passed by last infer request to ensure same route is taken each time.
We will use two SKlearn Iris classification models to illustrate a model with a mirror.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies in which we mirror traffic to iris also to iris2.
Start the experiment.
Wait for the experiment to be ready.
We get responses from iris but all requests would also have been mirrored to iris2
We can check the local prometheus port from the agent to validate requests went to iris2
Stop the experiment
Unload both models.
Let's check that the mul10 model was called.
Let's do an http call and check agaib the two models
In this example we create a Pipeline to chain two huggingface models to allow speech to sentiment functionalityand add an explainer to understand the result.
This example also illustrates how explainers can target pipelines to allow complex explanations flows.
This example requires ffmpeg package to be installed locally. run make install-requirements
for the Python dependencies.
Create a method to load speech from recorder; transform into mp3 and send at base64 data. On return of the result extract and show the text and sentiment.
We will load two Huggingface models for speech to text and text to sentiment.
To allow Alibi-Explain to more easily explain the sentiment we will need:
input and output transfrorms that take the Dict values input and output by the Huggingface sentiment model and turn them into values that Alibi-Explain can easily understand with the core values we want to explain and the outputs from the sentiment model.
A separate Pipeline to allow us to join the sentiment model with the output transform
These transform models are MLServer custom runtimes as shown below:
We can now create the final pipeline that will take speech and generate sentiment alongwith an explanation of why that sentiment was predicted.
We will wait for the explanation which is run asynchronously to the functional output from the Pipeline above.
This example illustrates how to use taints, tolerations with nodeAffinity or nodeSelector to assign GPU nodes to specific models.
To serve a model on a dedicated GPU node, you should follow these steps:
You can add the taint when you are creating the node or after the node has been provisioned. You can apply the same taint to multiple nodes, not just a single node. A common approach is to define the taint at the node pool level.
When you apply a NoSchedule taint to a node after it is created it may result in existing Pods that do not have a matching toleration to remain on the node without being evicted. To ensure that such Pods are removed, you can use the NoExecute taint effect instead.
In this example, the node includes several labels that are used later for node affinity settings. You may choose to specify some labels, while others are usually added by the cloud provider or a GPU operator installed in the cluster. \
To ensure a specific inference server Pod runs only on the nodes you've configured, you can use nodeSelector or nodeAffinity together with a toleration by modifying one of the following:
Seldon Server custom resource: Apply changes to each individual inference server.
ServerConfig custom resource: Apply settings across multiple inference servers at once.
Configuring Seldon Server custom resource
While nodeSelector requires an exact match of node labels for server Pods to select a node, nodeAffinity offers more fine-grained control. It enables a conditional approach by using logical operators in the node selection process. For more information, see .
In this example, a nodeSelector and a toleration is set for the Seldon Server custom resource.
In this example, a nodeAffinity and a toleration is set for the Seldon Server custom resource.
You can configure more advanced Pod selection using nodeAffinity, as in this example:
Configuring ServerConfig custom resource
This configuration automatically affects all servers using that ServerConfig, unless you specify server-specific overrides, which takes precedence.
When you have a set of inference servers running exclusively on GPU nodes, you can assign a model to one of those servers in two ways:
Custom model requirements (recommended)
Explicit server pinning
Here's the distinction between the two methods of assigning models to servers.
When you specify a requirement matching a server capability in the model custom resource it loads the model on any inference server with a capability matching the requirements.
Ensure that the additional capability that matches the requirement label is added to the Server custom resource.
Instead of adding a capability using extraCapabilities on a Server custom resource, you may also add to the list of capabilities in the associated ServerConfig custom resource. This applies to all servers referencing the configuration.
With these specifications, the model is loaded on replicas of inference servers created by the referenced Server custom resource.
Requires mlserver to be installed.
Deprecated: The MLServer CLI infer feature is experimental and will be removed in future work.
kubectl apply -f ../../models/cifar10.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/cifar10-outlier-detect.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/cifar10-drift-detect.yaml -n ${NAMESPACE}model.mlops.seldon.io/cifar10 created
model.mlops.seldon.io/cifar10-outlier created
model.mlops.seldon.io/cifar10-drift createdkubectl wait --for condition=ready --timeout=300s model cifar10 -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model cifar10-outlier -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model cifar10-drift -n ${NAMESPACE}model.mlops.seldon.io/cifar10 condition met
model.mlops.seldon.io/cifar10-outlier met
model.mlops.seldon.io/cifar10-drift condition metseldon model load -f ../../models/cifar10.yaml
seldon model load -f ../../models/cifar10-outlier-detect.yaml
seldon model load -f ../../models/cifar10-drift-detect.yaml{}
{}
{}
seldon model status cifar10 -w ModelAvailable | jq .
seldon model status cifar10-outlier -w ModelAvailable | jq .
seldon model status cifar10-drift -w ModelAvailable | jq .{}
{}
{}
kubectl apply -f ../../pipelines/cifar10.yaml -n ${NAMESPACE}kubectl wait --for condition=ready --timeout=300s pipelines cifar10-production -n ${NAMESPACE}pipeline.mlops.seldon.io/cifar10-production condition metseldon pipeline load -f ../../pipelines/cifar10.yamlseldon pipeline status cifar10-production -w PipelineReady | jq -M .{
"pipelineName": "cifar10-production",
"versions": [
{
"pipeline": {
"name": "cifar10-production",
"uid": "cifeii2ufmbc73e5insg",
"version": 1,
"steps": [
{
"name": "cifar10"
},
{
"name": "cifar10-drift",
"batch": {
"size": 20
}
},
{
"name": "cifar10-outlier"
}
],
"output": {
"steps": [
"cifar10.outputs",
"cifar10-outlier.outputs.is_outlier"
]
},
"kubernetesMeta": {}
},
"state": {
"pipelineVersion": 1,
"status": "PipelineReady",
"reason": "created pipeline",
"lastChangeTimestamp": "2023-06-30T14:40:09.047429817Z",
"modelsReady": true
}
}
]
}
kubectl delete -f ../../piplines/cifar10.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/cifar10.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/cifar10-outlier-detect.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/cifar10-drift-detect.yaml -n ${NAMESPACE}seldon pipeline unload cifar10-productionseldon model unload cifar10
seldon model unload cifar10-outlier
seldon model unload cifar10-driftimport requests
import json
from typing import Dict, List
import numpy as np
import os
import tensorflow as tf
from alibi_detect.utils.perturbation import apply_mask
from alibi_detect.datasets import fetch_cifar10c
import matplotlib.pyplot as plt
tf.keras.backend.clear_session()2023-06-30 15:39:28.732453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-30 15:39:28.732465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
train, test = tf.keras.datasets.cifar10.load_data()
X_train, y_train = train
X_test, y_test = test
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
classes = (
"plane",
"car",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
)
(50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)
outliers = []
for idx in range(0,X_train.shape[0]):
X_mask, mask = apply_mask(X_train[idx].reshape(1, 32, 32, 3),
mask_size=(14,14),
n_masks=1,
channels=[0,1,2],
mask_type='normal',
noise_distr=(0,1),
clip_rng=(0,1))
outliers.append(X_mask)
X_outliers = np.vstack(outliers)
X_outliers.shape(50000, 32, 32, 3)
corruption = ['brightness']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255reqJson = json.loads('{"inputs":[{"name":"input_1","data":[],"datatype":"FP32","shape":[]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"def infer(resourceName: str, batchSz: int, requestType: str):
if requestType == "outlier":
rows = X_outliers[0:0+batchSz]
elif requestType == "drift":
rows = X_corr[0:0+batchSz]
else:
rows = X_train[0:0+batchSz]
for i in range(batchSz):
show(rows[i])
reqJson["inputs"][0]["data"] = rows.flatten().tolist()
reqJson["inputs"][0]["shape"] = [batchSz, 32, 32, 3]
headers = {"Content-Type": "application/json", "seldon-model":resourceName}
response_raw = requests.post(url, json=reqJson, headers=headers)
print(response_raw)
print(response_raw.json())
def show(X):
plt.imshow(X.reshape(32, 32, 3))
plt.axis("off")
plt.show()
cat ../../models/cifar10.yaml
echo "---"
cat ../../models/cifar10-outlier-detect.yaml
echo "---"
cat ../../models/cifar10-drift-detect.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10
spec:
storageUri: "gs://seldon-models/triton/tf_cifar10"
requirements:
- tensorflow
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-outlier
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/outlier-detector"
requirements:
- mlserver
- alibi-detect
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-drift
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/drift-detector"
requirements:
- mlserver
- alibi-detect
cat ../../pipelines/cifar10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: cifar10-production
spec:
steps:
- name: cifar10
- name: cifar10-outlier
- name: cifar10-drift
batch:
size: 20
output:
steps:
- cifar10
- cifar10-outlier.outputs.is_outlier
infer("cifar10-production.pipeline",20, "normal")<Response [200]>
{'model_name': '', 'outputs': [{'data': [1.45001495e-08, 1.2525752e-09, 1.6298458e-07, 0.11529388, 1.7431412e-07, 6.1856604e-06, 0.8846994, 6.0739285e-09, 7.437921e-08, 4.7317337e-09, 1.26449e-06, 4.8814868e-09, 1.5153439e-09, 8.490656e-09, 5.5131194e-10, 1.1617216e-09, 5.7729294e-10, 2.8839776e-07, 0.0006149016, 0.99938357, 0.888746, 2.5331951e-06, 0.00012967695, 0.10531583, 2.4284174e-05, 6.3332986e-06, 0.0016261435, 1.13079e-05, 0.0013286703, 0.0028091935, 2.0993439e-06, 3.680449e-08, 0.0013269952, 2.1766558e-05, 0.99841356, 0.00015300694, 6.9472035e-06, 1.3277059e-05, 6.1860555e-05, 3.4072806e-07, 1.1205097e-05, 0.99997175, 1.9948227e-07, 6.9880834e-08, 3.3387135e-08, 5.2603138e-08, 3.0352305e-07, 4.3738982e-08, 5.3243946e-07, 1.5870584e-05, 0.0006525102, 0.013322109, 1.480307e-06, 0.9766325, 4.9847167e-05, 0.00058075984, 0.008405659, 5.2234273e-06, 0.00023390084, 0.000116047224, 1.6682397e-06, 5.7737526e-10, 0.9975605, 6.45564e-05, 0.002371972, 1.0392675e-07, 9.747962e-08, 1.4484569e-07, 8.762438e-07, 2.4758325e-08, 5.028761e-09, 6.856381e-11, 5.9932094e-12, 4.921233e-10, 1.471166e-07, 2.7940719e-06, 3.4563383e-09, 0.99999714, 5.9420524e-10, 9.445026e-11, 4.1854888e-05, 5.041549e-08, 8.0302314e-08, 1.2119854e-07, 6.781646e-09, 1.2616152e-08, 1.1878505e-08, 1.628573e-09, 0.9999578, 3.281738e-08, 0.08930307, 1.4065135e-07, 4.1117343e-07, 0.90898305, 8.933351e-07, 0.0015637449, 0.00013868928, 9.092981e-06, 4.8759745e-07, 4.3976044e-07, 0.00016094849, 3.5653954e-07, 0.0760521, 0.8927447, 0.0011777573, 0.00265573, 0.027189083, 4.1892267e-06, 1.329405e-05, 1.8564688e-06, 1.3373891e-06, 1.0251247e-07, 8.651912e-09, 4.458202e-06, 1.4646349e-05, 1.260957e-06, 1.046087e-08, 0.9998946, 8.332438e-05, 3.900894e-07, 6.53852e-05, 3.012202e-08, 1.0247197e-07, 1.8824371e-06, 0.0004958526, 3.533475e-05, 2.739997e-07, 0.99939275, 4.840305e-06, 3.5346695e-06, 0.0005518078, 3.1597017e-07, 0.99902296, 0.00031509742, 8.07886e-07, 1.6366084e-06, 2.795575e-06, 6.112367e-06, 9.817249e-05, 2.602709e-07, 0.0004561966, 5.360607e-06, 2.8656412e-05, 0.000116040654, 6.881144e-05, 8.844774e-06, 4.4655946e-05, 3.5564542e-05, 0.006564381, 0.9926715, 0.007300911, 1.766928e-06, 3.0520596e-07, 0.026906287, 1.3769699e-06, 0.00027539674, 5.583593e-06, 3.792553e-06, 0.0003876767, 0.9651169, 0.18114138, 2.8360228e-05, 0.00019927241, 0.007685872, 0.00014663498, 3.9361137e-05, 5.941682e-05, 7.36174e-05, 0.79936546, 0.01126067, 2.3992783e-11, 7.6336457e-16, 1.4644799e-15, 1, 2.4652159e-14, 1.1786078e-10, 1.9402116e-13, 4.2408636e-15, 1.209294e-15, 2.9042784e-15, 1.5366902e-08, 1.2476195e-09, 1.3560152e-07, 0.999997, 4.3113017e-11, 2.8163534e-08, 2.4494727e-06, 1.3122828e-10, 3.8081083e-07, 2.1628158e-11, 0.0004926238, 6.9424555e-06, 2.827196e-05, 0.92534137, 9.500486e-06, 0.00036133997, 0.072713904, 1.2831057e-07, 0.0010457055, 2.8514464e-07], 'name': 'fc10', 'shape': [20, 10], 'datatype': 'FP32'}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect cifar10-production.cifar10-drift.outputs.is_driftseldon.default.model.cifar10-drift.outputs cifeij8fh5ss738i5bp0 {"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["0"]}}
infer("cifar10-production.pipeline",20, "drift")<Response [200]>
{'model_name': '', 'outputs': [{'data': [8.080701e-09, 2.3025173e-12, 2.2681688e-09, 1, 4.1828953e-11, 4.48467e-09, 3.216822e-08, 2.8404365e-13, 5.217064e-09, 3.3497323e-13, 0.96965235, 4.7030144e-06, 1.6964266e-07, 1.7355454e-05, 2.6667e-06, 1.9505828e-06, 1.1363079e-07, 3.3352034e-08, 0.030320557, 1.7086056e-07, 0.03725602, 6.8623276e-06, 7.5557014e-05, 0.00018132397, 2.2838503e-05, 0.000110639296, 2.3732607e-06, 2.1210687e-06, 0.9623351, 7.131072e-06, 0.999079, 4.207448e-09, 1.5788535e-08, 2.723756e-08, 2.6555508e-11, 2.1526697e-10, 2.7599315e-10, 2.0737433e-10, 0.0009210062, 3.0885383e-09, 6.665241e-07, 1.7765576e-09, 1.4911559e-07, 0.9765331, 1.9476123e-07, 2.8244015e-06, 0.023463126, 5.8030287e-09, 3.243206e-09, 1.12179785e-08, 4.4123663e-06, 4.7628927e-09, 1.1727273e-08, 0.9761534, 1.1409252e-08, 8.922882e-05, 0.023752932, 3.1563903e-08, 2.7916305e-09, 8.7746266e-10, 1.0166265e-05, 0.999703, 4.5408615e-05, 0.00022673907, 1.7365853e-07, 1.0147362e-06, 6.253448e-06, 2.9711526e-07, 7.811687e-07, 6.183683e-06, 0.86618125, 5.47548e-07, 0.00038408802, 0.013155022, 3.6916779e-06, 0.0006137024, 0.11965008, 3.6425424e-06, 6.7638084e-06, 1.2372367e-06, 1.9545263e-05, 1.1281859e-13, 1.6811868e-14, 0.9999777, 1.9805435e-11, 2.7563674e-06, 2.9651657e-09, 1.1363432e-12, 2.9902746e-13, 1.220973e-12, 2.9895918e-05, 3.4964305e-07, 1.1331837e-08, 1.7012125e-06, 3.6088227e-07, 3.035954e-08, 2.2102333e-06, 1.7414077e-08, 0.9999455, 1.9921794e-05, 0.9999999, 5.3446598e-11, 6.3188843e-10, 1.0956511e-07, 1.1538642e-10, 8.113561e-10, 4.7179572e-08, 1.4544753e-11, 5.490219e-08, 1.3347151e-10, 1.5363307e-07, 6.604881e-09, 2.424105e-10, 9.963063e-09, 3.9349533e-09, 1.5709017e-09, 7.705774e-10, 4.8085802e-08, 1.8885139e-05, 0.9999809, 7.147243e-08, 3.143131e-13, 2.1447092e-13, 0.00042652222, 6.945973e-12, 0.9995734, 6.174434e-09, 4.1128205e-11, 3.4031404e-13, 8.573159e-15, 1.2226405e-09, 2.3768018e-10, 2.822187e-07, 8.016278e-08, 4.0692296e-08, 6.8023346e-06, 2.3926754e-07, 0.9999925, 6.652648e-09, 7.743497e-09, 7.6360675e-06, 5.9386625e-09, 1.5675019e-09, 2.136716e-07, 1.3074002e-06, 3.700079e-10, 1.0984521e-09, 6.2138824e-08, 0.9609078, 0.03908287, 0.0008332255, 7.696685e-08, 2.4428939e-09, 7.186676e-05, 1.4520063e-09, 1.4521317e-08, 1.09093e-06, 1.2531165e-10, 0.9990938, 5.798501e-09, 5.785368e-05, 3.82365e-09, 7.404351e-08, 0.008338481, 8.048078e-10, 0.99157715, 1.1663455e-05, 1.4583546e-05, 8.3543476e-08, 3.274394e-08, 2.4682688e-05, 1.3951502e-09, 1.0260489e-08, 0.9998845, 1.9418138e-08, 8.667954e-07, 2.1851054e-07, 8.917964e-05, 4.4437223e-07, 1.1292918e-07, 4.5302792e-07, 5.631744e-08, 2.9086214e-08, 3.1013877e-07, 7.695681e-09, 2.1452344e-09, 1.1493902e-08, 6.1980093e-10, 0.99999917, 1.1436694e-08, 2.42685e-05, 8.557389e-08, 0.024081504, 0.0073837163, 4.8152968e-05, 5.128531e-07, 0.9684405, 9.630179e-08, 2.1060101e-05, 1.901065e-07], 'name': 'fc10', 'shape': [20, 10], 'datatype': 'FP32'}, {'data': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'name': 'is_outlier', 'shape': [1, 20], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect cifar10-production.cifar10-drift.outputs.is_driftseldon.default.model.cifar10-drift.outputs cifeimgfh5ss738i5bpg {"name":"is_drift", "datatype":"INT64", "shape":["1", "1"], "parameters":{"content_type":{"stringParam":"np"}}, "contents":{"int64Contents":["1"]}}
infer("cifar10-production.pipeline",1, "outlier")<Response [200]>
{'model_name': '', 'outputs': [{'data': [6.3606867e-06, 0.0006106364, 0.0054279356, 0.6536454, 1.4738829e-05, 2.6104701e-06, 0.3397848, 1.3538776e-05, 0.0004458526, 4.807229e-05], 'name': 'fc10', 'shape': [1, 10], 'datatype': 'FP32'}, {'data': [1], 'name': 'is_outlier', 'shape': [1, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
infer("cifar10-production.pipeline",1, "ok")<Response [200]>
{'model_name': '', 'outputs': [{'data': [1.45001495e-08, 1.2525752e-09, 1.6298458e-07, 0.11529388, 1.7431412e-07, 6.1856604e-06, 0.8846994, 6.0739285e-09, 7.43792e-08, 4.7317337e-09], 'name': 'fc10', 'shape': [1, 10], 'datatype': 'FP32'}, {'data': [0], 'name': 'is_outlier', 'shape': [1, 1], 'datatype': 'INT64', 'parameters': {'content_type': 'np'}}]}
seldon pipeline inspect cifar10-production.cifar10.outputsseldon.default.model.cifar10.outputs cifeiq8fh5ss738i5bqg {"modelName":"cifar10_1", "modelVersion":"1", "outputs":[{"name":"fc10", "datatype":"FP32", "shape":["1", "10"], "contents":{"fp32Contents":[1.45001495e-8, 1.2525752e-9, 1.6298458e-7, 0.11529388, 1.7431412e-7, 0.0000061856604, 0.8846994, 6.0739285e-9, 7.43792e-8, 4.7317337e-9]}}]}
cat ./models/sklearn1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
cat ./models/sklearn2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris2
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml{}
{}
seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable{}
{}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris2 -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::50]
cat ./experiments/ab-default-model.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
seldon experiment start -f ./experiments/ab-default-model.yamlseldon experiment status experiment-sample -w | jq -M .{
"experimentName": "experiment-sample",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
from ipywebrtc import AudioRecorder, CameraStream
import torchaudio
from IPython.display import Audio
import base64
import json
import requests
import os
import timereqJson = json.loads('{"inputs":[{"name":"args", "parameters": {"content_type": "base64"}, "data":[],"datatype":"BYTES","shape":[1]}]}')
url = "http://0.0.0.0:9000/v2/models/model/infer"
def infer(resource: str):
with open('recording.webm', 'wb') as f:
f.write(recorder.audio.value)
!ffmpeg -i recording.webm -vn -ab 128k -ar 44100 file.mp3 -y -hide_banner -loglevel panic
with open("file.mp3", mode='rb') as file:
fileContent = file.read()
encoded = base64.b64encode(fileContent)
base64_message = encoded.decode('utf-8')
reqJson["inputs"][0]["data"] = [str(base64_message)]
headers = {"Content-Type": "application/json", "seldon-model": resource}
response_raw = requests.post(url, json=reqJson, headers=headers)
j = response_raw.json()
sentiment = j["outputs"][0]["data"][0]
text = j["outputs"][1]["data"][0]
reqId = response_raw.headers["x-request-id"]
print(reqId)
os.environ["REQUEST_ID"]=reqId
print(base64.b64decode(text))
print(base64.b64decode(sentiment))








































Custom model requirements
If the assigned server cannot load the model due to insufficient resources, another similarly-capable server can be selected to load the model.
Explicit pinning
If the specified server lacks sufficient memory or resources, the model load fails without trying another server.
Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.
Ensure that you are performing these steps in the directory where you have downloaded the samples.
Get the IP address of the Seldon Core 2 instance running with Istio:
Output is similar to:
Make a gRPC inference call
Delete the model
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::27 :iris_1::23]
seldon model infer iris --show-headers \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'> POST /v2/models/iris/infer HTTP/1.1
> Host: localhost:9000
> Content-Type:[application/json]
> Seldon-Model:[iris]
< X-Seldon-Route:[:iris_1:]
< Ce-Id:[463e96ad-645f-4442-8890-4c340b58820b]
< Traceparent:[00-fe9e87fcbe4be98ed82fb76166e15ceb-d35e7ac96bd8b718-01]
< X-Envoy-Upstream-Service-Time:[3]
< Ce-Specversion:[0.3]
< Date:[Thu, 29 Jun 2023 14:03:03 GMT]
< Ce-Source:[io.seldon.serving.deployment.mlserver]
< Content-Type:[application/json]
< Server:[envoy]
< X-Request-Id:[cieou5ofh5ss73fbjdu0]
< Ce-Endpoint:[iris_1]
< Ce-Modelid:[iris_1]
< Ce-Type:[io.seldon.serving.inference.response]
< Content-Length:[213]
< Ce-Inferenceservicename:[mlserver]
< Ce-Requestid:[463e96ad-645f-4442-8890-4c340b58820b]
{
"model_name": "iris_1",
"model_version": "1",
"id": "463e96ad-645f-4442-8890-4c340b58820b",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
seldon model infer iris -s -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
seldon model infer iris --inference-mode grpc -s -i 50\
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'Success: map[:iris_1::50]
seldon experiment stop experiment-sampleseldon model unload iris
seldon model unload iris2cat ./models/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
cat ./models/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
seldon model load -f ./models/add10.yaml
seldon model load -f ./models/mul10.yaml{}
{}
seldon model status add10 -w ModelAvailable
seldon model status mul10 -w ModelAvailable{}
{}
cat ./pipelines/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline-mul10
spec:
steps:
- name: mul10
output:
steps:
- mul10
cat ./pipelines/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline-add10
spec:
steps:
- name: add10
output:
steps:
- add10
seldon pipeline load -f ./pipelines/add10.yaml
seldon pipeline load -f ./pipelines/mul10.yamlseldon pipeline status pipeline-add10 -w PipelineReady
seldon pipeline status pipeline-mul10 -w PipelineReady{"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"cieov47l80lc739juklg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.460868091Z", "modelsReady":true}}]}
{"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"cieov47l80lc739jukm0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.631980330Z", "modelsReady":true}}]}
seldon pipeline infer pipeline-add10 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
11,
12,
13,
14
]
}
}
]
}
seldon pipeline infer pipeline-mul10 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
10,
20,
30,
40
]
}
}
]
}
cat ./experiments/addmul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: addmul10
spec:
default: pipeline-add10
resourceType: pipeline
candidates:
- name: pipeline-add10
weight: 50
- name: pipeline-mul10
weight: 50
seldon experiment start -f ./experiments/addmul10.yamlseldon experiment status addmul10 -w | jq -M .{
"experimentName": "addmul10",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon pipeline infer pipeline-add10 -i 50 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'Success: map[:add10_1::28 :mul10_1::22 :pipeline-add10.pipeline::28 :pipeline-mul10.pipeline::22]
seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> seldon-model:[pipeline-add10.pipeline]
< x-envoy-expected-rq-timeout-ms:[60000]
< x-request-id:[cieov8ofh5ss739277i0]
< date:[Thu, 29 Jun 2023 14:05:23 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-envoy-upstream-service-time:[6]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
< x-forwarded-proto:[http]
{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
> seldon-model:[pipeline-add10.pipeline]
< content-type:[application/grpc]
< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :pipeline-add10.pipeline:]
< x-request-id:[cieov90fh5ss739277ig]
< x-envoy-upstream-service-time:[7]
< date:[Thu, 29 Jun 2023 14:05:24 GMT]
< server:[envoy]
{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
seldon pipeline infer pipeline-add10 -s -i 50 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'Success: map[:add10_1::50 :pipeline-add10.pipeline::150]
cat ./models/add20.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add20
spec:
storageUri: "gs://seldon-models/triton/add20"
requirements:
- triton
- python
seldon model load -f ./models/add20.yaml{}
seldon model status add20 -w ModelAvailable{}
cat ./experiments/add1020.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: add1020
spec:
default: add10
candidates:
- name: add10
weight: 50
- name: add20
weight: 50
seldon experiment start -f ./experiments/add1020.yamlseldon experiment status add1020 -w | jq -M .{
"experimentName": "add1020",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer add10 -i 50 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'Success: map[:add10_1::22 :add20_1::28]
seldon pipeline infer pipeline-add10 -i 100 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'Success: map[:add10_1::24 :add20_1::32 :mul10_1::44 :pipeline-add10.pipeline::56 :pipeline-mul10.pipeline::44]
seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> seldon-model:[pipeline-add10.pipeline]
< x-request-id:[cieovf0fh5ss739279u0]
< x-envoy-upstream-service-time:[5]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
< date:[Thu, 29 Jun 2023 14:05:48 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]
{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
> seldon-model:[pipeline-add10.pipeline]
< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]
< x-request-id:[cieovf8fh5ss739279ug]
< x-envoy-upstream-service-time:[6]
< date:[Thu, 29 Jun 2023 14:05:49 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :add20_1: :pipeline-add10.pipeline:]
{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[21, 22, 23, 24]}}]}
seldon experiment stop addmul10
seldon experiment stop add1020
seldon pipeline unload pipeline-add10
seldon pipeline unload pipeline-mul10
seldon model unload add10
seldon model unload add20
seldon model unload mul10cat ./models/sklearn1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
cat ./models/sklearn2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris2
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml{}
{}
seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable{}
{}
cat ./experiments/sklearn-mirror.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: sklearn-mirror
spec:
default: iris
candidates:
- name: iris
weight: 100
mirror:
name: iris2
percent: 100
seldon experiment start -f ./experiments/sklearn-mirror.yamlseldon experiment status sklearn-mirror -w | jq -M .{
"experimentName": "sklearn-mirror",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon model infer iris -i 50 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris_1::50]
curl -s 0.0.0:9006/metrics | grep seldon_model_infer_total | grep iris2_1seldon_model_infer_total{code="200",method_type="rest",model="iris",model_internal="iris2_1",server="mlserver",server_replica="0"} 50
seldon experiment stop sklearn-mirrorseldon model unload iris
seldon model unload iris2cat ./models/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
cat ./models/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
seldon model load -f ./models/add10.yaml
seldon model load -f ./models/mul10.yaml{}
{}
seldon model status add10 -w ModelAvailable
seldon model status mul10 -w ModelAvailable{}
{}
cat ./pipelines/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline-mul10
spec:
steps:
- name: mul10
output:
steps:
- mul10
cat ./pipelines/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline-add10
spec:
steps:
- name: add10
output:
steps:
- add10
seldon pipeline load -f ./pipelines/add10.yaml
seldon pipeline load -f ./pipelines/mul10.yamlseldon pipeline status pipeline-add10 -w PipelineReady
seldon pipeline status pipeline-mul10 -w PipelineReady{"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"ciep072i8ufs73flaipg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:24.903503109Z", "modelsReady":true}}]}
{"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"ciep072i8ufs73flaiq0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:25.082642153Z", "modelsReady":true}}]}
seldon pipeline infer pipeline-add10 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
seldon pipeline infer pipeline-mul10 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
cat ./experiments/addmul10-mirror.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: addmul10-mirror
spec:
default: pipeline-add10
resourceType: pipeline
candidates:
- name: pipeline-add10
weight: 100
mirror:
name: pipeline-mul10
percent: 100
seldon experiment start -f ./experiments/addmul10-mirror.yamlseldon experiment status addmul10-mirror -w | jq -M .{
"experimentName": "addmul10-mirror",
"active": true,
"candidatesReady": true,
"mirrorReady": true,
"statusDescription": "experiment active",
"kubernetesMeta": {}
}
seldon pipeline infer pipeline-add10 -i 1 --inference-mode grpc \
'{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}
curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 2
curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 2
seldon pipeline infer pipeline-add10 -i 1 \
'{"model_name":"add10","inputs":[{"name":"INPUT","data":[1,2,3,4],"datatype":"FP32","shape":[4]}]}'{
"model_name": "",
"outputs": [
{
"data": [
11,
12,
13,
14
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}
curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 3
curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 3
seldon pipeline inspect pipeline-mul10seldon.default.model.mul10.inputs ciep0bofh5ss73dpdiq0 {"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
seldon.default.model.mul10.outputs ciep0bofh5ss73dpdiq0 {"modelName":"mul10_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
seldon.default.pipeline.pipeline-mul10.inputs ciep0bofh5ss73dpdiq0 {"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
seldon.default.pipeline.pipeline-mul10.outputs ciep0bofh5ss73dpdiq0 {"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
seldon experiment stop addmul10-mirror
seldon pipeline unload pipeline-add10
seldon pipeline unload pipeline-mul10
seldon model unload add10
seldon model unload mul10cat ../../models/hf-whisper.yaml
echo "---"
cat ../../models/hf-sentiment.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: whisper
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
requirements:
- huggingface
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
requirements:
- huggingface
kubectl apply -f ../../models/hf-whisper.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/hf-sentiment.yaml -n ${NAMESPACE}model.mlops.seldon.io/whisper created
model.mlops.seldon.io/sentiment createdkubectl wait --for condition=ready --timeout=300s model whisper -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model sentiment -n ${NAMESPACE}model.mlops.seldon.io/whisper condition met
model.mlops.seldon.io/sentiment condition metseldon model load -f ../../models/hf-whisper.yaml
seldon model load -f ../../models/hf-sentiment.yaml{}
{}
seldon model status whisper -w ModelAvailable | jq -M .
seldon model status sentiment -w ModelAvailable | jq -M .{}
{}
cat ./sentiment-input-transform/model.py | pygmentize# Copyright (c) 2024 Seldon Technologies Ltd.
# Use of this software is governed BY
# (1) the license included in the LICENSE file or
# (2) if the license included in the LICENSE file is the Business Source License 1.1,
# the Change License after the Change Date as each is defined in accordance with the LICENSE file.
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs.string import StringRequestCodec
from mlserver.logging import logger
import json
class SentimentInputTransformRuntime(MLModel):
async def load(self) -> bool:
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
logger.info("payload (input-transform): %s",payload)
res_list = self.decode_request(payload, default_codec=StringRequestCodec)
logger.info("res list (input-transform): %s",res_list)
texts = []
for res in res_list:
logger.info("decoded data (input-transform): %s", res)
#text = json.loads(res)
text = res
texts.append(text["text"])
logger.info("transformed data (input-transform): %s", texts)
response = StringRequestCodec.encode_response(
model_name="sentiment",
payload=texts
)
logger.info("response (input-transform): %s", response)
return response
cat ./sentiment-output-transform/model.py | pygmentize# Copyright (c) 2024 Seldon Technologies Ltd.
# Use of this software is governed BY
# (1) the license included in the LICENSE file or
# (2) if the license included in the LICENSE file is the Business Source License 1.1,
# the Change License after the Change Date as each is defined in accordance with the LICENSE file.
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import StringCodec, Base64Codec, NumpyRequestCodec
from mlserver.codecs.string import StringRequestCodec
from mlserver.codecs.numpy import NumpyRequestCodec
import base64
from mlserver.logging import logger
import numpy as np
import json
class SentimentOutputTransformRuntime(MLModel):
async def load(self) -> bool:
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
logger.info("payload (output-transform): %s",payload)
res_list = self.decode_request(payload, default_codec=StringRequestCodec)
logger.info("res list (output-transform): %s",res_list)
scores = []
for res in res_list:
logger.debug("decoded data (output transform): %s",res)
#sentiment = json.loads(res)
sentiment = res
if sentiment["label"] == "POSITIVE":
scores.append(1)
else:
scores.append(0)
response = NumpyRequestCodec.encode_response(
model_name="sentiments",
payload=np.array(scores)
)
logger.info("response (output-transform): %s", response)
return response
cat ../../models/hf-sentiment-input-transform.yaml
echo "---"
cat ../../models/hf-sentiment-output-transform.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-input-transform
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/mlserver_1.3.5/sentiment-input-transform"
requirements:
- mlserver
- python
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-output-transform
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/mlserver_1.3.5/sentiment-output-transform"
requirements:
- mlserver
- python
kubectl apply -f ../../models/hf-sentiment-input-transform.yaml -n ${NAMESPACE}
kubectl apply -f ../../models/hf-sentiment-output-transform.yaml -n ${NAMESPACE}model.mlops.seldon.io/sentiment-input-transform created
model.mlops.seldon.io/sentiment-output-transform createdkubectl wait --for condition=ready --timeout=300s model sentiment-input-transform -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s model sentiment-output-transform -n ${NAMESPACE}model.mlops.seldon.io/sentiment-input-transform condition met
model.mlops.seldon.io/sentiment-output-transform condition metseldon model load -f ../../models/hf-sentiment-input-transform.yaml
seldon model load -f ../../models/hf-sentiment-output-transform.yaml{}
{}
seldon model status sentiment-input-transform -w ModelAvailable | jq -M .
seldon model status sentiment-output-transform -w ModelAvailable | jq -M .{}
{}cat ../../pipelines/sentiment-explain.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: sentiment-explain
spec:
steps:
- name: sentiment
tensorMap:
sentiment-explain.inputs.predict: array_inputs
- name: sentiment-output-transform
inputs:
- sentiment
output:
steps:
- sentiment-output-transform
kubectl apply -f ../../models/sentiment-explain.yaml -n ${NAMESPACE}model.mlops.seldon.io/sentiment-explain createdkubectl wait --for condition=ready --timeout=300s model sentiment-explain -n ${NAMESPACE}model.mlops.seldon.io/sentiment-explain condition metseldon pipeline load -f ../../pipelines/sentiment-explain.yamlseldon pipeline status sentiment-explain -w PipelineReady | jq -M .{
"pipelineName": "sentiment-explain",
"versions": [
{
"pipeline": {
"name": "sentiment-explain",
"uid": "cihuo3svgtec73bj6ncg",
"version": 2,
"steps": [
{
"name": "sentiment",
"tensorMap": {
"sentiment-explain.inputs.predict": "array_inputs"
}
},
{
"name": "sentiment-output-transform",
"inputs": [
"sentiment.outputs"
]
}
],
"output": {
"steps": [
"sentiment-output-transform.outputs"
]
},
"kubernetesMeta": {}
},
"state": {
"pipelineVersion": 2,
"status": "PipelineReady",
"reason": "created pipeline",
"lastChangeTimestamp": "2023-07-04T09:53:19.250753906Z",
"modelsReady": true
}
}
]
}
cat ../../models/hf-sentiment-explainer.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
explainer:
type: anchor_text
pipelineRef: sentiment-explain
kubectl apply -f .../../pipelines/hf-sentiment-explainer.yaml -n ${NAMESPACE}model.mlops.seldon.io/hf-sentiment-explainer createdkubectl wait --for condition=ready --timeout=300s model hf-sentiment-explainer -n ${NAMESPACE}model.mlops.seldon.io/hf-sentiment-explainer condition metseldon model load -f ../../models/hf-sentiment-explainer.yaml{}
seldon model status sentiment-explainer -w ModelAvailable | jq -M .Error: Model wait status timeoutcat ../../pipelines/speech-to-sentiment.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: speech-to-sentiment
spec:
steps:
- name: whisper
- name: sentiment
inputs:
- whisper
tensorMap:
whisper.outputs.output: args
- name: sentiment-input-transform
inputs:
- whisper
- name: sentiment-explainer
inputs:
- sentiment-input-transform
output:
steps:
- sentiment
- whisper
kubectl apply -f .../../pipelines/speech-to-sentiment.yaml -n ${NAMESPACE}model.mlops.seldon.io/speech-to-sentiment createdkubectl wait --for condition=ready --timeout=300s model speech-to-sentiment -n ${NAMESPACE}model.mlops.seldon.io/speech-to-sentiment condition metseldon pipeline load -f ../../pipelines/speech-to-sentiment.yamlseldon pipeline status speech-to-sentiment -w PipelineReady | jq -M .{
"pipelineName": "speech-to-sentiment",
"versions": [
{
"pipeline": {
"name": "speech-to-sentiment",
"uid": "cihuqb4vgtec73bj6nd0",
"version": 2,
"steps": [
{
"name": "sentiment",
"inputs": [
"whisper.outputs"
],
"tensorMap": {
"whisper.outputs.output": "args"
}
},
{
"name": "sentiment-explainer",
"inputs": [
"sentiment-input-transform.outputs"
]
},
{
"name": "sentiment-input-transform",
"inputs": [
"whisper.outputs"
]
},
{
"name": "whisper"
}
],
"output": {
"steps": [
"sentiment.outputs",
"whisper.outputs"
]
},
"kubernetesMeta": {}
},
"state": {
"pipelineVersion": 2,
"status": "PipelineReady",
"reason": "created pipeline",
"lastChangeTimestamp": "2023-07-04T09:58:04.277171896Z",
"modelsReady": true
}
}
]
}
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorderAudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …
infer("speech-to-sentiment.pipeline")cihuqm8fh5ss73der5gg
b'{"text": " Cambridge is a great place."}'
b'{"label": "POSITIVE", "score": 0.9998548030853271}'
while True:
base64Res = !seldon pipeline inspect speech-to-sentiment.sentiment-explainer.outputs --format json \
--request-id ${REQUEST_ID}
j = json.loads(base64Res[0])
if j["topics"][0]["msgs"] is not None:
expBase64 = j["topics"][0]["msgs"][0]["value"]["outputs"][0]["contents"]["bytesContents"][0]
expRaw = base64.b64decode(expBase64)
exp = json.loads(expRaw)
print("")
print("Explanation anchors:",exp["data"]["anchor"])
break
else:
print(".",end='')
time.sleep(1)
......
Explanation anchors: ['great']
kubectl delete -f ../../pipelines/speech-to-sentiment.yaml -n ${NAMESPACE}
kubectl delete -f ../../pipelines/sentiment-explain.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/hf-whisper.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/hf-sentiment.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/hf-sentiment-input-transform.yaml -n ${NAMESPACE}
kubectl delete -f ../../models/hf-sentiment-output-transform.yaml -n ${NAMESPACE}seldon pipeline unload speech-to-sentiment
seldon pipeline unload sentiment-explainseldon model unload whisper
seldon model unload sentiment
seldon model unload sentiment-explainer
seldon model unload sentiment-output-transform
seldon model unload sentiment-input-transform
apiVersion: v1
kind: Node
metadata:
name: example-node # Replace with the actual node name
labels:
pool: infer-srv # Custom label
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb-SHARED # Sample label from GPU discovery
cloud.google.com/gke-accelerator: nvidia-a100-80gb # GKE without NVIDIA GPU operator
cloud.google.com/gke-accelerator-count: "2" # Accelerator count
spec:
taints:
- effect: NoSchedule
key: seldon-gpu-srv
value: "true"
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-llm-local-gpu # <server name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
replicas: 1
serverConfig: mlserver # <reference Serverconfig CR>
extraCapabilities:
- model-on-gpu # Custom capability for matching Model to this server
podSpec:
nodeSelector: # Schedule pods only on nodes with these labels
pool: infer-srv
cloud.google.com/gke-accelerator: nvidia-a100-80gb # Example requesting specific GPU on GKE
# cloud.google.com/gke-accelerator-count: 2 # Optional GPU count
tolerations: # Allow scheduling on nodes with the matching taint
- effect: NoSchedule
key: seldon-gpu-srv
operator: Equal
value: "true"
containers: # Override settings from Serverconfig if needed
- name: mlserver
resources:
requests:
nvidia.com/gpu: 1 # Request a GPU for the mlserver container
cpu: 40
memory: 360Gi
ephemeral-storage: 290Gi
limits:
nvidia.com/gpu: 2 # Limit to 2 GPUs
cpu: 40
memory: 360Gi
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-llm-local-gpu # <server name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
podSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "pool"
operator: In
values:
- infer-srv
- key: "cloud.google.com/gke-accelerator"
operator: In
values:
- nvidia-a100-80gb
tolerations: # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
- effect: NoSchedule
key: seldon-gpu-srv
operator: Equal
value: "true"
containers: # If needed, override settings from ServerConfig for this specific Server
- name: mlserver
resources:
requests:
nvidia.com/gpu: 1 # Request a GPU for the mlserver container
cpu: 40
memory: 360Gi
ephemeral-storage: 290Gi
limits:
nvidia.com/gpu: 2 # Limit to 2 GPUs
cpu: 40
memory: 360Gi
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-llm-local-gpu # <server name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
podSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "cloud.google.com/gke-accelerator-count"
operator: Gt # (greater than)
values: ["1"]
- key: "gpu.gpu-vendor.example/installed-memory"
operator: Gt
values: ["75000"]
- key: "feature.node.kubernetes.io/pci-10.present" # NFD Feature label
operator: In
values: ["true"] # (optional) only schedule on nodes with PCI device 10
tolerations: # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
- effect: NoSchedule
key: seldon-gpu-srv
operator: Equal
value: "true"
containers: # If needed, override settings from ServerConfig for this specific Server
- name: mlserver
env:
... # Add your environment variables here
image: ... # Specify your container image here
resources:
requests:
nvidia.com/gpu: 1 # Request a GPU for the mlserver container
cpu: 40
memory: 360Gi
ephemeral-storage: 290Gi
limits:
nvidia.com/gpu: 2 # Limit to 2 GPUs
cpu: 40
memory: 360Gi
... # Other configurations can go here
apiVersion: mlops.seldon.io/v1alpha1
kind: ServerConfig
metadata:
name: mlserver-llm # <ServerConfig name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
podSpec:
nodeSelector: # Schedule pods only on nodes with these labels
pool: infer-srv
cloud.google.com/gke-accelerator: nvidia-a100-80gb # Example requesting specific GPU on GKE
# cloud.google.com/gke-accelerator-count: 2 # Optional GPU count
tolerations: # Allow scheduling on nodes with the matching taint
- effect: NoSchedule
key: seldon-gpu-srv
operator: Equal
value: "true"
containers: # Define the container specifications
- name: mlserver
env: # Environment variables (fill in as needed)
...
image: ... # Specify the container image
resources:
requests:
nvidia.com/gpu: 1 # Request a GPU for the mlserver container
cpu: 40
memory: 360Gi
ephemeral-storage: 290Gi
limits:
nvidia.com/gpu: 2 # Limit to 2 GPUs
cpu: 40
memory: 360Gi
... # Additional container configurations
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama3 # <model name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
requirements:
- model-on-gpu # requirement matching a Server capability
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
name: mlserver-llm-local-gpu # <server name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
serverConfig: mlserver # <reference ServerConfig CR>
extraCapabilities:
- model-on-gpu # custom capability that can be used for matching Model to this server
# Other fields would go hereapiVersion: mlops.seldon.io/v1alpha1
kind: ServerConfig
metadata:
name: mlserver-llm # <ServerConfig name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
podSpec:
containers:
- name: agent # note the setting is applied to the agent container
env:
- name: SELDON_SERVER_CAPABILITIES
value: mlserver,alibi-detect,...,xgboost,model-on-gpu # add capability to the list
image: ...
# Other configurations go hereapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama3 # <model name>
namespace: seldon-mesh # <seldon runtime namespace>
spec:
server: mlserver-llm-local-gpu # <reference Server CR>
requirements:
- model-on-gpu # requirement matching a Server capability
curl -k http://<INGRESS_IP>:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: iris" \
-d '{
"inputs": [
{
"name": "predict",
"shape": [1, 4],
"datatype": "FP32",
"data": [[1, 2, 3, 4]]
}
]
}' | jq
seldon model infer iris --inference-host <INGRESS_IP>:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'for i in {1..10}; do
curl -s -k <INGRESS_IP>:80/v2/models/experiment-sample/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: experiment-sample.experiment" \
-d '{"inputs":[{"name":"predict","shape":[1,4],"datatype":"FP32","data":[[1,2,3,4]]}]}' \
| jq -r .model_name
done | sort | uniq -c
4 iris2_1
6 iris_1
seldon model infer --inference-host <INGRESS_IP>:80 -i 10 iris \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'Success: map[:iris2_1::4 :iris_1::6]
curl -k <INGRESS_IP>:80/v2/models/tfsimples/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimples.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' |jq
{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimples --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}
curl -k <INGRESS_IP>:80/v2/models/join/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: join.pipeline" \
-d '{
"model_name": "simple",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' |jq
{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer join --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .
{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"cat ./models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Ki
kubectl create -f ./models/sklearn-iris-gs.yaml -n seldon-meshmodel.mlops.seldon.io/iris created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/iris condition met
kubectl get model iris -n seldon-mesh -o jsonpath='{.status}' | jq -M .{
"conditions": [
{
"lastTransitionTime": "2023-06-30T10:01:52Z",
"message": "ModelAvailable",
"status": "True",
"type": "ModelReady"
},
{
"lastTransitionTime": "2023-06-30T10:01:52Z",
"status": "True",
"type": "Ready"
}
],
"replicas": 1
}
Make a REST inference call{
"model_name": "iris_1",
"model_version": "1",
"id": "7fd401e1-3dce-46f5-9668-902aea652b89",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
seldon model infer iris --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .{
"modelName": "iris_1",
"modelVersion": "1",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "np"
}
},
"contents": {
"int64Contents": [
"2"
]
}
}
]
}
kubectl get server mlserver -n seldon-mesh -o jsonpath='{.status}' | jq -M .{
"conditions": [
{
"lastTransitionTime": "2023-06-30T09:59:12Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2023-06-30T09:59:12Z",
"reason": "StatefulSet replicas matches desired replicas",
"status": "True",
"type": "StatefulSetReady"
}
],
"loadedModels": 1,
"replicas": 1,
"selector": "seldon-server-name=mlserver"
}
kubectl delete -f ./models/sklearn-iris-gs.yaml -n seldon-meshmodel.mlops.seldon.io "iris" deleted
cat ./models/sklearn1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
cat ./models/sklearn2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris2
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
kubectl create -f ./models/sklearn1.yaml -n seldon-mesh
kubectl create -f ./models/sklearn2.yaml -n seldon-meshmodel.mlops.seldon.io/iris created
model.mlops.seldon.io/iris2 created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/iris condition met
model.mlops.seldon.io/iris2 condition met
cat ./experiments/ab-default-model.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
name: experiment-sample
spec:
default: iris
candidates:
- name: iris
weight: 50
- name: iris2
weight: 50
kubectl create -f ./experiments/ab-default-model.yaml -n seldon-meshexperiment.mlops.seldon.io/experiment-sample created
kubectl wait --for condition=ready --timeout=300s experiment --all -n seldon-meshexperiment.mlops.seldon.io/experiment-sample condition met
kubectl delete -f ./experiments/ab-default-model.yaml -n seldon-mesh
kubectl delete -f ./models/sklearn1.yaml -n seldon-mesh
kubectl delete -f ./models/sklearn2.yaml -n seldon-meshexperiment.mlops.seldon.io "experiment-sample" deleted
model.mlops.seldon.io "iris" deleted
model.mlops.seldon.io "iris2" deleted
cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition met
cat ./pipelines/tfsimples.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimples
spec:
steps:
- name: tfsimple1
- name: tfsimple2
inputs:
- tfsimple1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple1.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimples.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimples created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimples condition met
kubectl delete -f ./pipelines/tfsimples.yaml -n seldon-meshpipeline.mlops.seldon.io "tfsimples" deleted
kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml
cat ./models/tfsimple3.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple3
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple3.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
model.mlops.seldon.io/tfsimple3 created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition met
model.mlops.seldon.io/tfsimple3 condition met
cat ./pipelines/tfsimples-join.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: join
spec:
steps:
- name: tfsimple1
- name: tfsimple2
- name: tfsimple3
inputs:
- tfsimple1.outputs.OUTPUT0
- tfsimple2.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple2.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple3
kubectl create -f ./pipelines/tfsimples-join.yaml -n seldon-meshpipeline.mlops.seldon.io/join created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/join condition met
kubectl delete -f ./pipelines/tfsimples-join.yaml -n seldon-meshpipeline.mlops.seldon.io "join" deleted
kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple3.yaml -n seldon-meshmodel.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
model.mlops.seldon.io "tfsimple3" deleted
cat ./models/income.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/classifier"
requirements:
- sklearn
kubectl create -f ./models/income.yaml -n seldon-meshmodel.mlops.seldon.io/income created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/income condition met
kubectl get model income -n seldon-mesh -o jsonpath='{.status}' | jq -M .{
"availableReplicas": 1,
"conditions": [
{
"lastTransitionTime": "2025-10-30T08:37:24Z",
"message": "ModelAvailable",
"status": "True",
"type": "ModelReady"
},
{
"lastTransitionTime": "2025-10-30T08:37:24Z",
"status": "True",
"type": "Ready"
}
],
"modelgwReady": "ModelAvailable(1/1 ready ) ",
"replicas": 1,
"selector": "server=mlserver"
}seldon model infer income --inference-host <INGRESS_IP>:80 \
'{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income_1",
"model_version": "1",
"id": "cdf32df2-eb42-42d8-9f66-404bcab95540",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
0
]
}
]
}
cat ./models/income-explainer.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/income/explainer"
explainer:
type: anchor_tabular
modelRef: income
kubectl create -f ./models/income-explainer.yaml -n seldon-meshmodel.mlops.seldon.io/income-explainer created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/income condition met
model.mlops.seldon.io/income-explainer condition met
kubectl get model income-explainer -n seldon-mesh -o jsonpath='{.status}' | jq -M .{
"conditions": [
{
"lastTransitionTime": "2023-06-30T10:03:07Z",
"message": "ModelAvailable",
"status": "True",
"type": "ModelReady"
},
{
"lastTransitionTime": "2023-06-30T10:03:07Z",
"status": "True",
"type": "Ready"
}
],
"replicas": 1
}
seldon model infer income-explainer --inference-host <INGRESS_IP>:80 \
'{"inputs": [{"name": "predict", "shape": [1, 12], "datatype": "FP32", "data": [[47,4,1,1,1,3,4,1,0,0,40,9]]}]}'{
"model_name": "income-explainer_1",
"model_version": "1",
"id": "3028a904-9bb3-42d7-bdb7-6e6993323ed7",
"parameters": {},
"outputs": [
{
"name": "explanation",
"shape": [
1,
1
],
"datatype": "BYTES",
"parameters": {
"content_type": "str"
},
"data": [
"{\"meta\": {\"name\": \"AnchorTabular\", \"type\": [\"blackbox\"], \"explanations\": [\"local\"], \"params\": {\"seed\": 1, \"disc_perc\": [25, 50, 75], \"threshold\": 0.95, \"delta\": 0.1, \"tau\": 0.15, \"batch_size\": 100, \"coverage_samples\": 10000, \"beam_size\": 1, \"stop_on_first\": false, \"max_anchor_size\": null, \"min_samples_start\": 100, \"n_covered_ex\": 10, \"binary_cache_size\": 10000, \"cache_margin\": 1000, \"verbose\": false, \"verbose_every\": 1, \"kwargs\": {}}, \"version\": \"0.9.1\"}, \"data\": {\"anchor\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"precision\": 0.9705882352941176, \"coverage\": 0.0699, \"raw\": {\"feature\": [3, 5], \"mean\": [0.8094218415417559, 0.9705882352941176], \"precision\": [0.8094218415417559, 0.9705882352941176], \"coverage\": [0.3036, 0.0699], \"examples\": [{\"covered_true\": [[23, 4, 1, 1, 5, 1, 4, 0, 0, 0, 40, 9], [44, 4, 1, 1, 8, 0, 4, 1, 0, 0, 40, 9], [60, 2, 5, 1, 5, 1, 4, 0, 0, 0, 25, 9], [52, 4, 1, 1, 2, 0, 4, 1, 0, 0, 50, 9], [66, 6, 1, 1, 8, 0, 4, 1, 0, 0, 8, 9], [52, 4, 1, 1, 8, 0, 4, 1, 0, 0, 40, 9], [27, 4, 1, 1, 1, 1, 4, 1, 0, 0, 35, 9], [48, 4, 1, 1, 6, 0, 4, 1, 0, 0, 45, 9], [45, 6, 1, 1, 5, 0, 4, 1, 0, 0, 40, 9], [40, 2, 1, 1, 5, 4, 4, 0, 0, 0, 45, 9]], \"covered_false\": [[42, 6, 5, 1, 6, 0, 4, 1, 99999, 0, 80, 9], [29, 4, 1, 1, 8, 1, 4, 1, 0, 0, 50, 9], [49, 4, 1, 1, 8, 0, 4, 1, 0, 0, 50, 9], [34, 4, 5, 1, 8, 0, 4, 1, 0, 0, 40, 9], [38, 2, 1, 1, 5, 5, 4, 0, 7688, 0, 40, 9], [45, 7, 5, 1, 5, 0, 4, 1, 0, 0, 45, 9], [43, 4, 2, 1, 5, 0, 4, 1, 99999, 0, 55, 9], [47, 4, 5, 1, 6, 1, 4, 1, 27828, 0, 60, 9], [42, 6, 1, 1, 2, 0, 4, 1, 15024, 0, 60, 9], [56, 4, 1, 1, 6, 0, 2, 1, 7688, 0, 45, 9]], \"uncovered_true\": [], \"uncovered_false\": []}, {\"covered_true\": [[23, 4, 1, 1, 4, 3, 4, 1, 0, 0, 40, 9], [50, 2, 5, 1, 8, 3, 2, 1, 0, 0, 45, 9], [24, 4, 1, 1, 7, 3, 4, 0, 0, 0, 40, 3], [62, 4, 5, 1, 5, 3, 4, 1, 0, 0, 40, 9], [22, 4, 1, 1, 5, 3, 4, 1, 0, 0, 40, 9], [44, 4, 1, 1, 1, 3, 4, 0, 0, 0, 40, 9], [46, 4, 1, 1, 4, 3, 4, 1, 0, 0, 40, 9], [44, 4, 1, 1, 2, 3, 4, 1, 0, 0, 40, 9], [25, 4, 5, 1, 5, 3, 4, 1, 0, 0, 35, 9], [32, 2, 5, 1, 5, 3, 4, 1, 0, 0, 50, 9]], \"covered_false\": [[57, 5, 5, 1, 6, 3, 4, 1, 99999, 0, 40, 9], [44, 4, 1, 1, 8, 3, 4, 1, 7688, 0, 60, 9], [43, 2, 5, 1, 4, 3, 2, 0, 8614, 0, 47, 9], [56, 5, 2, 1, 5, 3, 4, 1, 99999, 0, 70, 9]], \"uncovered_true\": [], \"uncovered_false\": []}], \"all_precision\": 0, \"num_preds\": 1000000, \"success\": true, \"names\": [\"Marital Status = Never-Married\", \"Relationship = Own-child\"], \"prediction\": [0], \"instance\": [47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0], \"instances\": [[47.0, 4.0, 1.0, 1.0, 1.0, 3.0, 4.0, 1.0, 0.0, 0.0, 40.0, 9.0]]}}}"
]
}
]
}
kubectl delete -f ./models/income.yaml -n seldon-mesh
kubectl delete -f ./models/income-explainer.yaml -n seldon-meshmodel.mlops.seldon.io "income" deleted
model.mlops.seldon.io "income-explainer" deleted
pip install mlserverimport os
os.environ["NAMESPACE"] = "seldon-mesh"MESH_IP=!kubectl get svc seldon-mesh -n ${NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP=MESH_IP[0]
import os
os.environ['MESH_IP'] = MESH_IP
MESH_IP'172.18.255.2'
cat models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Ki
cat pipelines/iris.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: iris-pipeline
spec:
steps:
- name: iris
output:
steps:
- iris
cat models/tfsimple1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
cat pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl apply -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
kubectl apply -f pipelines/iris.yaml -n ${NAMESPACE}
kubectl apply -f models/tfsimple1.yaml -n ${NAMESPACE}
kubectl apply -f pipelines/tfsimple.yaml -n ${NAMESPACE}model.mlops.seldon.io/iris created
pipeline.mlops.seldon.io/iris-pipeline created
model.mlops.seldon.io/tfsimple1 created
pipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s model --all -n ${NAMESPACE}
kubectl wait --for condition=ready --timeout=300s pipelines --all -n ${NAMESPACE}model.mlops.seldon.io/iris condition met
model.mlops.seldon.io/tfsimple1 condition met
pipeline.mlops.seldon.io/iris-pipeline condition met
pipeline.mlops.seldon.io/tfsimple condition met
seldon model infer iris --inference-host ${MESH_IP}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq -M .{
"model_name": "iris_1",
"model_version": "1",
"id": "25e1c1b9-a20f-456d-bdff-c75d5ba83b1f",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
seldon pipeline infer iris-pipeline --inference-host ${MESH_IP}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2
],
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
}
}
]
}
seldon model infer tfsimple1 --inference-host ${MESH_IP}:80 \
'{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
]
}
seldon pipeline infer tfsimple --inference-host ${MESH_IP}:80 \
'{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat batch-inputs/iris-input.txt | head -n 1 | jq -M .{
"inputs": [
{
"name": "predict",
"data": [
0.38606369295833043,
0.006894049558299753,
0.6104082981607108,
0.3958954239450676
],
"datatype": "FP64",
"shape": [
1,
4
]
}
]
}
%%bash
mlserver infer -u ${MESH_IP} -m iris -i batch-inputs/iris-input.txt -o /tmp/iris-output.txt --workers 5
2023-06-30 11:05:32,389 [mlserver] INFO - server url: 172.18.255.2
2023-06-30 11:05:32,389 [mlserver] INFO - model name: iris
2023-06-30 11:05:32,389 [mlserver] INFO - request headers: {}
2023-06-30 11:05:32,389 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-06-30 11:05:32,389 [mlserver] INFO - output file path: /tmp/iris-output.txt
2023-06-30 11:05:32,389 [mlserver] INFO - workers: 5
2023-06-30 11:05:32,389 [mlserver] INFO - retries: 3
2023-06-30 11:05:32,389 [mlserver] INFO - batch interval: 0.0
2023-06-30 11:05:32,389 [mlserver] INFO - batch jitter: 0.0
2023-06-30 11:05:32,389 [mlserver] INFO - connection timeout: 60
2023-06-30 11:05:32,389 [mlserver] INFO - micro-batch size: 1
2023-06-30 11:05:32,503 [mlserver] INFO - Finalizer: processed instances: 100
2023-06-30 11:05:32,503 [mlserver] INFO - Total processed instances: 100
2023-06-30 11:05:32,503 [mlserver] INFO - Time taken: 0.11 seconds
%%bash
mlserver infer -u ${MESH_IP} -m iris-pipeline.pipeline -i batch-inputs/iris-input.txt -o /tmp/iris-pipeline-output.txt --workers 5
2023-06-30 11:05:35,857 [mlserver] INFO - server url: 172.18.255.2
2023-06-30 11:05:35,858 [mlserver] INFO - model name: iris-pipeline.pipeline
2023-06-30 11:05:35,858 [mlserver] INFO - request headers: {}
2023-06-30 11:05:35,858 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-06-30 11:05:35,858 [mlserver] INFO - output file path: /tmp/iris-pipeline-output.txt
2023-06-30 11:05:35,858 [mlserver] INFO - workers: 5
2023-06-30 11:05:35,858 [mlserver] INFO - retries: 3
2023-06-30 11:05:35,858 [mlserver] INFO - batch interval: 0.0
2023-06-30 11:05:35,858 [mlserver] INFO - batch jitter: 0.0
2023-06-30 11:05:35,858 [mlserver] INFO - connection timeout: 60
2023-06-30 11:05:35,858 [mlserver] INFO - micro-batch size: 1
2023-06-30 11:05:36,145 [mlserver] INFO - Finalizer: processed instances: 100
2023-06-30 11:05:36,146 [mlserver] INFO - Total processed instances: 100
2023-06-30 11:05:36,146 [mlserver] INFO - Time taken: 0.29 seconds
cat /tmp/iris-output.txt | head -n 1 | jq -M .{
"model_name": "iris_1",
"model_version": "1",
"id": "46bdfca2-8805-4a72-b1ce-95e4f38c1a19",
"parameters": {
"inference_id": "46bdfca2-8805-4a72-b1ce-95e4f38c1a19",
"batch_index": 0
},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
1
]
}
]
}
cat /tmp/iris-pipeline-output.txt | head -n 1 | jq .{
"model_name": "",
"id": "37e8c013-b348-41e8-89b9-fea86a4f9632",
"parameters": {
"batch_index": 1
},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
1
]
}
]
}
cat batch-inputs/tfsimple-input.txt | head -n 1 | jq -M .{
"inputs": [
{
"name": "INPUT0",
"data": [
75,
39,
9,
44,
32,
97,
99,
40,
13,
27,
25,
36,
18,
77,
62,
60
],
"datatype": "INT32",
"shape": [
1,
16
]
},
{
"name": "INPUT1",
"data": [
39,
7,
14,
58,
13,
88,
98,
66,
97,
57,
49,
3,
49,
63,
37,
12
],
"datatype": "INT32",
"shape": [
1,
16
]
}
]
}
%%bash
mlserver infer -u ${MESH_IP} -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-output.txt --workers 5 -b
2023-06-30 11:22:52,662 [mlserver] INFO - server url: 172.18.255.2
2023-06-30 11:22:52,662 [mlserver] INFO - model name: tfsimple1
2023-06-30 11:22:52,662 [mlserver] INFO - request headers: {}
2023-06-30 11:22:52,662 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-06-30 11:22:52,662 [mlserver] INFO - output file path: /tmp/tfsimple-output.txt
2023-06-30 11:22:52,662 [mlserver] INFO - workers: 5
2023-06-30 11:22:52,662 [mlserver] INFO - retries: 3
2023-06-30 11:22:52,662 [mlserver] INFO - batch interval: 0.0
2023-06-30 11:22:52,662 [mlserver] INFO - batch jitter: 0.0
2023-06-30 11:22:52,662 [mlserver] INFO - connection timeout: 60
2023-06-30 11:22:52,662 [mlserver] INFO - micro-batch size: 1
2023-06-30 11:22:52,755 [mlserver] INFO - Finalizer: processed instances: 100
2023-06-30 11:22:52,755 [mlserver] INFO - Total processed instances: 100
2023-06-30 11:22:52,756 [mlserver] INFO - Time taken: 0.09 seconds
%%bash
mlserver infer -u ${MESH_IP} -m tfsimple.pipeline -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-pipeline-output.txt --workers 5
2023-06-30 11:22:54,065 [mlserver] INFO - server url: 172.18.255.2
2023-06-30 11:22:54,065 [mlserver] INFO - model name: tfsimple.pipeline
2023-06-30 11:22:54,065 [mlserver] INFO - request headers: {}
2023-06-30 11:22:54,065 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-06-30 11:22:54,065 [mlserver] INFO - output file path: /tmp/tfsimple-pipeline-output.txt
2023-06-30 11:22:54,065 [mlserver] INFO - workers: 5
2023-06-30 11:22:54,065 [mlserver] INFO - retries: 3
2023-06-30 11:22:54,065 [mlserver] INFO - batch interval: 0.0
2023-06-30 11:22:54,065 [mlserver] INFO - batch jitter: 0.0
2023-06-30 11:22:54,065 [mlserver] INFO - connection timeout: 60
2023-06-30 11:22:54,065 [mlserver] INFO - micro-batch size: 1
2023-06-30 11:22:54,302 [mlserver] INFO - Finalizer: processed instances: 100
2023-06-30 11:22:54,302 [mlserver] INFO - Total processed instances: 100
2023-06-30 11:22:54,303 [mlserver] INFO - Time taken: 0.24 seconds
cat /tmp/tfsimple-output.txt | head -n 1 | jq -M .{
"model_name": "tfsimple1_1",
"model_version": "1",
"id": "19952272-b023-4079-aa08-f1880ded05e5",
"parameters": {
"inference_id": "19952272-b023-4079-aa08-f1880ded05e5",
"batch_index": 1
},
"outputs": [
{
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
115,
69,
97,
112,
73,
106,
58,
182,
114,
66,
64,
110,
100,
24,
22,
77
]
},
{
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
-77,
33,
25,
-52,
-49,
-88,
-48,
0,
-50,
26,
-44,
46,
-2,
18,
-6,
-47
]
}
]
}
cat /tmp/tfsimple-pipeline-output.txt | head -n 1 | jq -M .{
"model_name": "",
"id": "46b05aab-07d9-414d-be96-c03d1863552a",
"parameters": {
"batch_index": 3
},
"outputs": [
{
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32",
"data": [
140,
164,
85,
58,
152,
76,
70,
56,
100,
141,
98,
181,
115,
177,
106,
193
]
},
{
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32",
"data": [
-10,
0,
-11,
-38,
2,
-36,
-52,
-8,
-18,
57,
94,
-5,
-27,
17,
58,
-1
]
}
]
}
kubectl delete -f models/sklearn-iris-gs.yaml -n ${NAMESPACE}
kubectl delete -f pipelines/iris.yaml -n ${NAMESPACE}model.mlops.seldon.io "iris" deleted
pipeline.mlops.seldon.io "iris-pipeline" deleted
kubectl delete -f models/tfsimple1.yaml -n ${NAMESPACE}
kubectl delete -f pipelines/tfsimple.yaml -n ${NAMESPACE}model.mlops.seldon.io "tfsimple1" deleted
pipeline.mlops.seldon.io "tfsimple" deleted
This page describes a predict/inference API independent of any specific ML/DL framework and model server. These APIs are able to support both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server. It is sometimes referred to by its old name "V2 Inference Protocol".
For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.
The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.
The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.
All strings in all contexts are case-sensitive.
For Seldon a server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).
Health:
Server Metadata:
Model Metadata:
Inference:
A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.
The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.
The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.
The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies.
The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains theServer Metadata Response JSON Object or theServer Metadata Response JSON Error Object.
A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as$metadata_server_response, is returned in the HTTP body.
“name” : A descriptive name for the server.
"version" : The server version.
“extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions.
A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the$metadata_server_error_response object.
“error” : The descriptive message for the error.
The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains theModel Metadata Response JSON Object or theModel Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as$metadata_model_response, is returned in the HTTP body for every successful model metadata request.
“name” : The name of the model.
"versions" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
“platform” : The framework/backend for the model. SeePlatforms.
“inputs” : The inputs required by the model.
“outputs” : The outputs produced by the model.
Each model input and output tensors’ metadata is described with a$metadata_tensor object.
“name” : The name of the tensor.
"datatype" : The data-type of the tensor elements as defined inTensor Data Types.
"shape" : The shape of the tensor. Variable-size dimensions are specified as -1.
A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the$metadata_model_error_response object.
“error” : The descriptive message for the error.
An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains theInference Request JSON Object. In the corresponding response the HTTP body contains theInference Response JSON Object orInference Response JSON Error Object. SeeInference Request Examples for some example HTTP/REST requests and responses.
The inference request object, identified as $inference_request, is
required in the HTTP body of the POST request. The model name and
(optionally) version must be available in the URL. If a version is not
provided the server may choose a version based on its own policies or
return an error.
id : An identifier for this request. Optional, but if specified
this identifier must be returned in the response.
parameters : An object containing zero or more parameters for this
inference request expressed as key/value pairs. SeeParameters for more information.
inputs : The input tensors. Each input is described using the$request_input schema defined in Request Input.
outputs : The output tensors requested for this inference. Each
requested output is described using the $request_output schema
defined in . Optional, if not
specified all outputs produced by the model will be returned using
default $request_output settings.
Request Input
The $request_input JSON describes an input to the model. If the
input is batched, the shape and data must represent the full shape and
contents of the entire batch.
"name": The name of the input tensor.
"shape": The shape of the input tensor. Each dimension must be an
integer representable as an unsigned 64-bit integer value.
"datatype": The data-type of the input tensor elements as defined
in Tensor Data Types.
"parameters": An object containing zero or more parameters for this
input expressed as key/value pairs. See
for more information.
“data”: The contents of the tensor. See
for more information.
Request Output
The $request_output JSON is used to request which output tensors
should be returned from the model.
"name": The name of the output tensor.
"parameters": An object containing zero or more parameters for this
output expressed as key/value pairs. See Parameters
for more information.
A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as$inference_response, is returned in the HTTP body.
"model_name": The name of the model used for inference.
"model_version": The specific model version used for
inference. Inference servers that do not implement versioning should
not provide this field in the response.
"id": The "id" identifier given in the request, if any.
"parameters": An object containing zero or more parameters for this
response expressed as key/value pairs. See
for more information.
"outputs": The output tensors. Each output is described using the$response_output schema defined in.
Response Output
The $response_output JSON describes an output from the model. If the
output is batched, the shape and data represents the full shape of the
entire batch.
"name": The name of the output tensor.
"shape": The shape of the output tensor. Each dimension must be an
integer representable as an unsigned 64-bit integer value.
"datatype": The data-type of the output tensor elements as defined
in Tensor Data Types.
"parameters": An object containing zero or more parameters for this
input expressed as key/value pairs. See
for more information.
“data”: The contents of the tensor. See
for more information.
A failed inference request must be indicated by an HTTP error status
(typically 400). The HTTP body must contain the$inference_error_response object.
“error”: The descriptive message for the error.
The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.
For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.
The $parameters JSON describes zero or more “name”/”value” pairs,
where the “name” is the name of the parameter and the “value” is a$string, $number, or $boolean.
Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.
Tensor data must be presented in row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.
Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 is problematic to communicate explicitly since there is not a standard fp16 representation across backends nor typically the programmatic support to create the fp16 representation for a JSON number.
For example, the 2-dimensional matrix:
Can be represented in its natural format as:
Or in a flattened one-dimensional representation:
The GRPC API closely follows the concepts defined in theHTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section.
All strings in all contexts are case-sensitive.
The GRPC definition of the service is:
A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.
The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are:
The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are:
The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are:
The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are:
The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are:
The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are:
The Parameters message describes a “name”/”value” pair, where the“name” is the name of the parameter and the “value” is a boolean,
integer, or string corresponding to the parameter.
Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.
In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements.
Using a "raw" representation of tensors withModelInferRequest::raw_input_contents andModelInferResponse::raw_output_contents will typically allow higher
performance due to the way protobuf allocation and reuse interacts
with GRPC. For example, see issue here.
An alternative to the "raw" representation is to use
InferTensorContents to represent the tensor data in a format that
matches the tensor's data type.
A platform is a string indicating a DL/ML framework or
backend. Platform is returned as part of the response to aModel Metadata request but is information only. The
proposed inference APIs are generic relative to the DL/ML framework
used by a model and so a client does not need to know the platform of
a given model to use the API. Platform names use the format“<project>_<format>”. The following platform names are allowed:
tensorrt_plan: A TensorRT model encoded as a serialized engine or “plan”.
tensorflow_graphdef: A TensorFlow model encoded as a GraphDef.
tensorflow_savedmodel: A TensorFlow model encoded as a SavedModel.
onnx_onnxv1: A ONNX model encoded for ONNX Runtime.
pytorch_torchscript: A PyTorch model encoded as TorchScript.
mxnet_mxnet An MXNet model
caffe2_netdef: A Caffe2 model encoded as a NetDef.
Tensor data types are shown in the following table along with the size of each type, in bytes.
BOOL
1
UINT8
1
UINT16
2
UINT32
4
UINT64
8
INT8
1
This document is based on the KServe original created during the lifetime of the KFServing project in Kubeflow by its various contributors including Seldon, NVIDIA, IBM, Bloomberg and others.
The below setup also illustrates using kafka specific prefixes for topics and consumerIds for isolation where the kafka cluster is shared with other applications and you want to enforce constraints. You would not strictly need this in this example as we install Kafka just for Seldon here.
GET v2/health/live
GET v2/health/ready
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/readyGET v2GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer$metadata_server_response =
{
"name" : $string,
"version" : $string,
"extensions" : [ $string, ... ]
}$metadata_server_error_response =
{
"error": $string
}$metadata_model_response =
{
"name" : $string,
"versions" : [ $string, ... ] #optional,
"platform" : $string,
"inputs" : [ $metadata_tensor, ... ],
"outputs" : [ $metadata_tensor, ... ]
}$metadata_tensor =
{
"name" : $string,
"datatype" : $string,
"shape" : [ $number, ... ]
}$metadata_model_error_response =
{
"error": $string
}$inference_request =
{
"id" : $string #optional,
"parameters" : $parameters #optional,
"inputs" : [ $request_input, ... ],
"outputs" : [ $request_output, ... ] #optional
}$request_input =
{
"name" : $string,
"shape" : [ $number, ... ],
"datatype" : $string,
"parameters" : $parameters #optional,
"data" : $tensor_data
}$request_output =
{
"name" : $string,
"parameters" : $parameters #optional,
}$inference_response =
{
"model_name" : $string,
"model_version" : $string #optional,
"id" : $string,
"parameters" : $parameters #optional,
"outputs" : [ $response_output, ... ]
}$response_output =
{
"name" : $string,
"shape" : [ $number, ... ],
"datatype" : $string,
"parameters" : $parameters #optional,
"data" : $tensor_data
}$inference_error_response =
{
"error": <error message string>
}POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
"id" : "42",
"inputs" : [
{
"name" : "input0",
"shape" : [ 2, 2 ],
"datatype" : "UINT32",
"data" : [ 1, 2, 3, 4 ]
},
{
"name" : "input1",
"shape" : [ 3 ],
"datatype" : "BOOL",
"data" : [ true ]
}
],
"outputs" : [
{
"name" : "output0"
}
]
}HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <yy>
{
"id" : "42"
"outputs" : [
{
"name" : "output0",
"shape" : [ 3, 2 ],
"datatype" : "FP32",
"data" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ]
}
]
}$parameters =
{
$parameter, ...
}
$parameter = $string : $string | $number | $boolean[ 1 2
4 5 ]"data" : [ [ 1, 2 ], [ 4, 5 ] ]"data" : [ 1, 2, 4, 5 ]//
// Inference Server GRPC endpoints.
//
service GRPCInferenceService
{
// Check liveness of the inference server.
rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}
// Check readiness of the inference server.
rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}
// Check readiness of a model in the inference server.
rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}
// Get server metadata.
rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}
// Get model metadata.
rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}
// Perform inference using a specific model.
rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
}message ServerLiveRequest {}
message ServerLiveResponse
{
// True if the inference server is live, false if not live.
bool live = 1;
}message ServerReadyRequest {}
message ServerReadyResponse
{
// True if the inference server is ready, false if not ready.
bool ready = 1;
}message ModelReadyRequest
{
// The name of the model to check for readiness.
string name = 1;
// The version of the model to check for readiness. If not given the
// server will choose a version based on the model and internal policy.
string version = 2;
}
message ModelReadyResponse
{
// True if the model is ready, false if not ready.
bool ready = 1;
}message ServerMetadataRequest {}
message ServerMetadataResponse
{
// The server name.
string name = 1;
// The server version.
string version = 2;
// The extensions supported by the server.
repeated string extensions = 3;
}message ModelMetadataRequest
{
// The name of the model.
string name = 1;
// The version of the model to check for readiness. If not given the
// server will choose a version based on the model and internal policy.
string version = 2;
}
message ModelMetadataResponse
{
// Metadata for a tensor.
message TensorMetadata
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape. A variable-size dimension is represented
// by a -1 value.
repeated int64 shape = 3;
}
// The model name.
string name = 1;
// The versions of the model available on the server.
repeated string versions = 2;
// The model's platform. See Platforms.
string platform = 3;
// The model's inputs.
repeated TensorMetadata inputs = 4;
// The model's outputs.
repeated TensorMetadata outputs = 5;
}message ModelInferRequest
{
// An input tensor for an inference request.
message InferInputTensor
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape.
repeated int64 shape = 3;
// Optional inference input tensor parameters.
map<string, InferParameter> parameters = 4;
// The tensor contents using a data-type format. This field must
// not be specified if "raw" tensor contents are being used for
// the inference request.
InferTensorContents contents = 5;
}
// An output tensor requested for an inference request.
message InferRequestedOutputTensor
{
// The tensor name.
string name = 1;
// Optional requested output tensor parameters.
map<string, InferParameter> parameters = 2;
}
// The name of the model to use for inferencing.
string model_name = 1;
// The version of the model to use for inference. If not given the
// server will choose a version based on the model and internal policy.
string model_version = 2;
// Optional identifier for the request. If specified will be
// returned in the response.
string id = 3;
// Optional inference parameters.
map<string, InferParameter> parameters = 4;
// The input tensors for the inference.
repeated InferInputTensor inputs = 5;
// The requested output tensors for the inference. Optional, if not
// specified all outputs produced by the model will be returned.
repeated InferRequestedOutputTensor outputs = 6;
// The data contained in an input tensor can be represented in "raw"
// bytes form or in the repeated type that matches the tensor's data
// type. To use the raw representation 'raw_input_contents' must be
// initialized with data for each tensor in the same order as
// 'inputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 data type must be represented as raw
// content as there is no specific data type for a 16-bit float
// type.
//
// If this field is specified then InferInputTensor::contents must
// not be specified for any input tensor.
repeated bytes raw_input_contents = 7;
}
message ModelInferResponse
{
// An output tensor returned for an inference request.
message InferOutputTensor
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape.
repeated int64 shape = 3;
// Optional output tensor parameters.
map<string, InferParameter> parameters = 4;
// The tensor contents using a data-type format. This field must
// not be specified if "raw" tensor contents are being used for
// the inference response.
InferTensorContents contents = 5;
}
// The name of the model used for inference.
string model_name = 1;
// The version of the model used for inference.
string model_version = 2;
// The id of the inference request if one was specified.
string id = 3;
// Optional inference response parameters.
map<string, InferParameter> parameters = 4;
// The output tensors holding inference results.
repeated InferOutputTensor outputs = 5;
// The data contained in an output tensor can be represented in
// "raw" bytes form or in the repeated type that matches the
// tensor's data type. To use the raw representation 'raw_output_contents'
// must be initialized with data for each tensor in the same order as
// 'outputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 data type must be represented as raw
// content as there is no specific data type for a 16-bit float
// type.
//
// If this field is specified then InferOutputTensor::contents must
// not be specified for any output tensor.
repeated bytes raw_output_contents = 6;
}//
// An inference parameter value.
//
message InferParameter
{
// The parameter value can be a string, an int64, a boolean
// or a message specific to a predefined parameter.
oneof parameter_choice
{
// A boolean parameter value.
bool bool_param = 1;
// An int64 parameter value.
int64 int64_param = 2;
// A string parameter value.
string string_param = 3;
}
}//
// The data contained in a tensor represented by the repeated type
// that matches the tensor's data type. Protobuf oneof is not used
// because oneofs cannot contain repeated fields.
//
message InferTensorContents
{
// Representation for BOOL data type. The size must match what is
// expected by the tensor's shape. The contents must be the flattened,
// one-dimensional, row-major order of the tensor elements.
repeated bool bool_contents = 1;
// Representation for INT8, INT16, and INT32 data types. The size
// must match what is expected by the tensor's shape. The contents
// must be the flattened, one-dimensional, row-major order of the
// tensor elements.
repeated int32 int_contents = 2;
// Representation for INT64 data types. The size must match what
// is expected by the tensor's shape. The contents must be the
// flattened, one-dimensional, row-major order of the tensor elements.
repeated int64 int64_contents = 3;
// Representation for UINT8, UINT16, and UINT32 data types. The size
// must match what is expected by the tensor's shape. The contents
// must be the flattened, one-dimensional, row-major order of the
// tensor elements.
repeated uint32 uint_contents = 4;
// Representation for UINT64 data types. The size must match what
// is expected by the tensor's shape. The contents must be the
// flattened, one-dimensional, row-major order of the tensor elements.
repeated uint64 uint64_contents = 5;
// Representation for FP32 data type. The size must match what is
// expected by the tensor's shape. The contents must be the flattened,
// one-dimensional, row-major order of the tensor elements.
repeated float fp32_contents = 6;
// Representation for FP64 data type. The size must match what is
// expected by the tensor's shape. The contents must be the flattened,
// one-dimensional, row-major order of the tensor elements.
repeated double fp64_contents = 7;
// Representation for BYTES data type. The size must match what is
// expected by the tensor's shape. The contents must be the flattened,
// one-dimensional, row-major order of the tensor elements.
repeated bytes bytes_contents = 8;
}INT16
2
INT32
4
INT64
8
FP16
2
FP32
4
FP64
8
BYTES
Variable (max 232)
If you have installed Kafka via the ansible playbook setup-ecosystem then you can use the following command to see the consumer group ids which are reflecting the settings we created.
We can similarly show the topics that have been created.
curl -k http://${MESH_IP_NS1}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [1,4],
"data": [[1,2,3,4]]
}
]
}' | jq -M .seldon model infer iris --inference-host ${MESH_IP_NS1}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'curl -k http://${MESH_IP_NS1}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"model_name": "iris",
"inputs": [
{
"name": "input",
"datatype": "FP32",
"shape": [1,4],
"data": [1,2,3,4]
}
]
}' | jq -M .seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP_NS1}:80 \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .curl -k http://${MESH_IP_NS2}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [1,4],
"data": [[1,2,3,4]]
}
]
}' | jq -M .seldon model infer iris --inference-host ${MESH_IP_NS2}:80 \
'{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'curl -k http://${MESH_IP_NS2}:80/v2/models/iris/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: iris" \
-H "Content-Type: application/json" \
-d '{
"model_name": "iris",
"inputs": [
{
"name": "input",
"datatype": "FP32",
"shape": [1,4],
"data": [1,2,3,4]
}
]
}' | jq -M .
seldon model infer iris --inference-mode grpc --inference-host ${MESH_IP_NS2}:80 \
'{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' | jq -M .curl -k http://${MESH_IP_NS1}:80/v2/models/tfsimples/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimples.pipeline" \
-H "Content-Type: application/json" \
-d '{
"model_name": "simple",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .
seldon pipeline infer tfsimples --inference-mode grpc --inference-host ${MESH_IP_NS1}:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .curl -k http://${MESH_IP_NS2}:80/v2/models/tfsimples/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimples.pipeline" \
-H "Content-Type: application/json" \
-d '{
"model_name": "simple",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .seldon pipeline infer tfsimples --inference-mode grpc --inference-host ${MESH_IP_NS2}:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .helm upgrade --install seldon-core-v2-crds ../k8s/helm-charts/seldon-core-v2-crds -n seldon-meshRelease "seldon-core-v2-crds" does not exist. Installing it now.
NAME: seldon-core-v2-crds
LAST DEPLOYED: Tue Aug 15 11:01:03 2023
NAMESPACE: seldon-mesh
STATUS: deployed
REVISION: 1
TEST SUITE: None
helm upgrade --install seldon-v2 ../k8s/helm-charts/seldon-core-v2-setup/ -n seldon-mesh \
--set controller.clusterwide=true \
--set kafka.topicPrefix=myorg \
--set kafka.consumerGroupIdPrefix=myorgRelease "seldon-v2" does not exist. Installing it now.
NAME: seldon-v2
LAST DEPLOYED: Tue Aug 15 11:01:07 2023
NAMESPACE: seldon-mesh
STATUS: deployed
REVISION: 1
TEST SUITE: None
kubectl create namespace ns1
kubectl create namespace ns2namespace/ns1 created
namespace/ns2 created
helm install seldon-v2-runtime ../k8s/helm-charts/seldon-core-v2-runtime -n ns1 --waitNAME: seldon-v2-runtime
LAST DEPLOYED: Tue Aug 15 11:01:11 2023
NAMESPACE: ns1
STATUS: deployed
REVISION: 1
TEST SUITE: None
helm install seldon-v2-servers ../k8s/helm-charts/seldon-core-v2-servers -n ns1 --waitNAME: seldon-v2-servers
LAST DEPLOYED: Tue Aug 15 10:47:31 2023
NAMESPACE: ns1
STATUS: deployed
REVISION: 1
TEST SUITE: None
helm install seldon-v2-runtime ../k8s/helm-charts/seldon-core-v2-runtime -n ns2 --waitNAME: seldon-v2-runtime
LAST DEPLOYED: Tue Aug 15 10:53:12 2023
NAMESPACE: ns2
STATUS: deployed
REVISION: 1
TEST SUITE: None
helm install seldon-v2-servers ../k8s/helm-charts/seldon-core-v2-servers -n ns2 --waitNAME: seldon-v2-servers
LAST DEPLOYED: Tue Aug 15 10:53:28 2023
NAMESPACE: ns2
STATUS: deployed
REVISION: 1
TEST SUITE: None
kubectl wait --for condition=ready --timeout=300s server --all -n ns1server.mlops.seldon.io/mlserver condition met
server.mlops.seldon.io/triton condition met
kubectl wait --for condition=ready --timeout=300s server --all -n ns2server.mlops.seldon.io/mlserver condition met
server.mlops.seldon.io/triton condition met
MESH_IP=!kubectl get svc seldon-mesh -n ns1 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP_NS1=MESH_IP[0]
import os
os.environ['MESH_IP_NS1'] = MESH_IP_NS1
MESH_IP_NS1'172.18.255.2'
MESH_IP=!kubectl get svc seldon-mesh -n ns2 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
MESH_IP_NS2=MESH_IP[0]
import os
os.environ['MESH_IP_NS2'] = MESH_IP_NS2
MESH_IP_NS2'172.18.255.4'
cat ./models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
requirements:
- sklearn
memory: 100Ki
kubectl create -f ./models/sklearn-iris-gs.yaml -n ns1model.mlops.seldon.io/iris created
kubectl wait --for condition=ready --timeout=300s model --all -n ns1model.mlops.seldon.io/iris condition met
{
"model_name": "iris_1",
"model_version": "1",
"id": "3ca1757c-df02-4e57-87c1-38311bcc5943",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
{
"modelName": "iris_1",
"modelVersion": "1",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "np"
}
},
"contents": {
"int64Contents": [
"2"
]
}
}
]
}
kubectl create -f ./models/sklearn-iris-gs.yaml -n ns2model.mlops.seldon.io/iris created
kubectl wait --for condition=ready --timeout=300s model --all -n ns2model.mlops.seldon.io/iris condition met{
"model_name": "iris_1",
"model_version": "1",
"id": "f706a23e-775f-4765-bd18-2e98d83bf7d5",
"parameters": {},
"outputs": [
{
"name": "predict",
"shape": [
1,
1
],
"datatype": "INT64",
"parameters": {
"content_type": "np"
},
"data": [
2
]
}
]
}
{
"modelName": "iris_1",
"modelVersion": "1",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "np"
}
},
"contents": {
"int64Contents": [
"2"
]
}
}
]
}
kubectl delete -f ./models/sklearn-iris-gs.yaml -n ns1
kubectl delete -f ./models/sklearn-iris-gs.yaml -n ns2model.mlops.seldon.io "iris" deleted
model.mlops.seldon.io "iris" deleted
kubectl create -f ./models/tfsimple1.yaml -n ns1
kubectl create -f ./models/tfsimple2.yaml -n ns1
kubectl create -f ./models/tfsimple1.yaml -n ns2
kubectl create -f ./models/tfsimple2.yaml -n ns2model.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
model.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
kubectl wait --for condition=ready --timeout=300s model --all -n ns1
kubectl wait --for condition=ready --timeout=300s model --all -n ns2model.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition met
model.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition met
kubectl create -f ./pipelines/tfsimples.yaml -n ns1
kubectl create -f ./pipelines/tfsimples.yaml -n ns2pipeline.mlops.seldon.io/tfsimples created
pipeline.mlops.seldon.io/tfsimples created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n ns1
kubectl wait --for condition=ready --timeout=300s pipeline --all -n ns2pipeline.mlops.seldon.io/tfsimples condition met
pipeline.mlops.seldon.io/tfsimples condition met
{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}
{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}
kubectl exec seldon-kafka-0 -n seldon-mesh -- bin/kafka-consumer-groups.sh --list --bootstrap-server localhost:9092myorg-ns2-seldon-pipelinegateway-dfd61b49-4bb9-4684-adce-0b7cc215d3af
myorg-ns2-seldon-modelgateway-17
myorg-ns1-seldon-pipelinegateway-d4fc83e6-29cb-442e-90cd-92a389961cfe
myorg-ns2-seldon-modelgateway-60
myorg-ns2-seldon-dataflow-73d465744b7b1b5be20e88d6245e50bd
myorg-ns1-seldon-modelgateway-60
myorg-ns1-seldon-modelgateway-17
myorg-ns1-seldon-dataflow-f563e04e093caa20c03e6eced084331b
kubectl exec seldon-kafka-0 -n seldon-mesh -- bin/kafka-topics.sh --bootstrap-server=localhost:9092 --list__consumer_offsets
myorg.ns1.errors.errors
myorg.ns1.model.iris.inputs
myorg.ns1.model.iris.outputs
myorg.ns1.model.tfsimple1.inputs
myorg.ns1.model.tfsimple1.outputs
myorg.ns1.model.tfsimple2.inputs
myorg.ns1.model.tfsimple2.outputs
myorg.ns1.pipeline.tfsimples.inputs
myorg.ns1.pipeline.tfsimples.outputs
myorg.ns2.errors.errors
myorg.ns2.model.iris.inputs
myorg.ns2.model.iris.outputs
myorg.ns2.model.tfsimple1.inputs
myorg.ns2.model.tfsimple1.outputs
myorg.ns2.model.tfsimple2.inputs
myorg.ns2.model.tfsimple2.outputs
myorg.ns2.pipeline.tfsimples.inputs
myorg.ns2.pipeline.tfsimples.outputs
kubectl delete -f ./pipelines/tfsimples.yaml -n ns1
kubectl delete -f ./pipelines/tfsimples.yaml -n ns2pipeline.mlops.seldon.io "tfsimples" deleted
pipeline.mlops.seldon.io "tfsimples" deleted
kubectl delete -f ./models/tfsimple1.yaml -n ns1
kubectl delete -f ./models/tfsimple2.yaml -n ns1
kubectl delete -f ./models/tfsimple1.yaml -n ns2
kubectl delete -f ./models/tfsimple2.yaml -n ns2model.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
model.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
helm delete seldon-v2-servers -n ns1 --wait
helm delete seldon-v2-servers -n ns2 --waitrelease "seldon-v2-servers" uninstalled
release "seldon-v2-servers" uninstalled
helm delete seldon-v2-runtime -n ns1 --wait
helm delete seldon-v2-runtime -n ns2 --waitrelease "seldon-v2-runtime" uninstalled
release "seldon-v2-runtime" uninstalled
helm delete seldon-v2 -n seldon-mesh --waitrelease "seldon-v2" uninstalled
helm delete seldon-core-v2-crds -n seldon-meshrelease "seldon-core-v2-crds" uninstalled
kubectl delete namespace ns1
kubectl delete namespace ns2namespace "ns1" deleted
namespace "ns2" deleted


These examples illustrates a series of Pipelines showing of different ways of combining flows of data and conditional logic. We assume you have Seldon Core 2 running locally.
Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.
Ensure that you are performing these steps in the directory where you have downloaded the .
Get the IP address of the Seldon Core 2 instance running with Istio:
gs://seldon-models/triton/simple an example Triton tensorflow model that takes 2 inputs INPUT0
and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1 from INPUT0 to produce OUTPUT1.
See
for the original source code and license.
Other models can be found at https://github.com/SeldonIO/triton-python-examples
Chain the output of one model into the next. Also shows chaning the tensor names via tensorMap to conform to the expected input tensor names of the second model.
This pipeline chains the output of tfsimple1 into tfsimple2. As these models have compatible shape and data type this can be done. However, the output tensor names from tfsimple1 need to be renamed to match the input tensor names for tfsimple2. You can do this with the tensorMap feature.
The output of the Pipeline is the output from tfsimple2.
You can use the Seldon CLI pipeline inspect feature to look at the data for all steps of the pipeline for the last data item passed through the pipeline (the default). This can be useful for debugging.
Next, take a look get the output as json and use the jq tool to get just one value.
Chain the output of one model into the next. Shows using the input and outputs and combining.
Join two flows of data from two models as input to a third model. This shows how individual flows of data can be combined.
In this pipeline for the input to tfsimple3 and join 1 output tensor each from the two previous models tfsimple1 and tfsimple2. You need to use the tensorMap feature to rename each output tensor to one of the expected input tensors for the tfsimple3 model.
The outputs are the sequence "2,4,6..." which conforms to the logic of this model (addition and subtraction) when fed the output of the first two models.
Shows conditional data flows - one of two models is run based on output tensors from first.
Here we assume the conditional model can output two tensors OUTPUT0 and OUTPUT1 but only outputs the former if the CHOICE input tensor is set to 0 otherwise it outputs tensor OUTPUT1. By this means only one of the two downstream models will receive data and run. The output steps does an any join from both models and whichever data appears first will be sent as output to pipeline. As in this case only 1 of the two models add10 and mul10 runs we will receive their output.
The mul10 model runs as the CHOICE tensor is set to 0.
The add10 model will run as the CHOICE tensor is not set to zero.
Access to indivudal tensors in pipeline inputs
This pipeline shows how we can access pipeline inputs INPUT0 and INPUT1 from different steps.
Shows how joins can be used for triggers as well.
Here we required tensors names ok1 or ok2 to exist on pipeline inputs to run the mul10 model but require tensor ok3 to exist on pipeline inputs to run the add10 model. The logic on mul10 is handled by a trigger join of any meaning either of these input data can exist to satisfy the trigger join.
Trigger the first join.
Now, you can trigger the second join.
ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimples.yamlkubectl create -f ./pipelines/tfsimples.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimples created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimples condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimples/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimples.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' |jq
{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimples --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}
seldon pipeline inspect tfsimplesseldon.default.model.tfsimple1.inputs ciep298fh5ss73dpdir0 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs ciep298fh5ss73dpdir0 {"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.model.tfsimple2.inputs ciep298fh5ss73dpdir0 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.model.tfsimple2.outputs ciep298fh5ss73dpdir0 {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimples.inputs ciep298fh5ss73dpdir0 {"modelName":"tfsimples", "inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimples.outputs ciep298fh5ss73dpdir0 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon pipeline inspect tfsimples --format json | jq -M .topics[0].msgs[0].value{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
]
}
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
]
}
}
]
}
kubectl delete -f ./pipelines/tfsimples.yaml -n seldon-meshpipeline.mlops.seldon.io "tfsimples" deleted
kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimples-input.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimples-input
spec:
steps:
- name: tfsimple1
- name: tfsimple2
inputs:
- tfsimple1.inputs.INPUT0
- tfsimple1.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimples-input.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimples-input created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimples-input condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimples-input/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimples-input.pipeline" \
-d '{
"model_name": "simple",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimples-input \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
kubectl delete -f ./pipelines/tfsimples-input.yaml -n seldon-meshpipeline.mlops.seldon.io "tfsimples-input" deletedkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deletedcat ./models/tfsimple1.yaml
echo "---"
cat ./models/tfsimple2.yaml
echo "---"
cat ./models/tfsimple3.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple3
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple3.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 created
model.mlops.seldon.io/tfsimple3 created
kubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition met
model.mlops.seldon.io/tfsimple3 condition met
cat ./pipelines/tfsimples-join.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: join
spec:
steps:
- name: tfsimple1
- name: tfsimple2
- name: tfsimple3
inputs:
- tfsimple1.outputs.OUTPUT0
- tfsimple2.outputs.OUTPUT1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple2.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple3kubectl create -f ./pipelines/tfsimples-join.yaml -n seldon-meshpipeline.mlops.seldon.io/join createdcurl -k http://<INGRESS_IP>:80/v2/models/join/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: join.pipeline" \
-d '{
"model_name": "simple",
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1, 16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' |jq{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer join --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"simple","inputs":[{"name":"INPUT0","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","contents":{"int_contents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]},"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
},
{
"name": "OUTPUT1",
"datatype": "INT32",
"shape": [
"1",
"16"
],
"contents": {
"intContents": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
}
]
}kubectl delete -f ./pipelines/tfsimples-join.yaml -n seldon-meshpipeline.mlops.seldon.io "join" deleted
kubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple3.yaml -n seldon-meshmodel.mlops.seldon.io "tfsimple1" deleted
model.mlops.seldon.io "tfsimple2" deleted
model.mlops.seldon.io "tfsimple3" deleted
cat ./models/conditional.yaml
echo "---"
cat ./models/add10.yaml
echo "---"
cat ./models/mul10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: conditional
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/conditional"
requirements:
- triton
- python
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
kubectl create -f ./models/conditional.yaml -n seldon-mesh
kubectl create -f ./models/add10.yaml -n seldon-mesh
kubectl create -f ./models/mul10.yaml -n seldon-meshmodel.mlops.seldon.io/conditional created
model.mlops.seldon.io/add10 created
model.mlops.seldon.io/mul10 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/conditional condition met
model.mlops.seldon.io/add10 condition met
model.mlops.seldon.io/mul10 condition met
cat ./pipelines/conditional.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-conditional
spec:
steps:
- name: conditional
- name: mul10
inputs:
- conditional.outputs.OUTPUT0
tensorMap:
conditional.outputs.OUTPUT0: INPUT
- name: add10
inputs:
- conditional.outputs.OUTPUT1
tensorMap:
conditional.outputs.OUTPUT1: INPUT
output:
steps:
- mul10
- add10
stepsJoin: any
kubectl create -f ./pipelines/conditional.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-conditional createdkubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-conditional condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple-conditional/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple-conditional.pipeline" \
-d '{
"model_name": "conditional",
"inputs": [
{
"name": "CHOICE",
"datatype": "INT32",
"shape": [1],
"data": [0]
},
{
"name": "INPUT0",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
},
{
"name": "INPUT1",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
}
]
}' | jq -M .
{
"model_name": "",
"outputs": [
{
"data": [
10,
20,
30,
40
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}
seldon pipeline infer tfsimple-conditional --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"conditional","inputs":[{"name":"CHOICE","contents":{"int_contents":[0]},"datatype":"INT32","shape":[1]},{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
10,
20,
30,
40
]
}
}
]
}
curl -k http://<INGRESS_IP>:80/v2/models/tfsimple-conditional/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple-conditional.pipeline" \
-d '{
"model_name": "conditional",
"inputs": [
{
"name": "CHOICE",
"datatype": "INT32",
"shape": [1],
"data": [1]
},
{
"name": "INPUT0",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
},
{
"name": "INPUT1",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
}
]
}' | jq -M .
{
"model_name": "",
"outputs": [
{
"data": [
11,
12,
13,
14
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}
seldon pipeline infer tfsimple-conditional --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"conditional","inputs":[{"name":"CHOICE","contents":{"int_contents":[1]},"datatype":"INT32","shape":[1]},{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
11,
12,
13,
14
]
}
}
]
}
kubectl delete -f ./pipelines/conditional.yaml -n seldon-meshkubectl delete -f ./models/conditional.yaml -n seldon-mesh
kubectl delete -f ./models/add10.yaml -n seldon-mesh
kubectl delete -f ./models/mul10.yaml -n seldon-meshcat ./models/mul10.yaml
echo "---"
cat ./models/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
kubectl create -f ./models/mul10.yaml -n seldon-mesh
kubectl create -f ./models/add10.yaml -n seldon-meshmodel.mlops.seldon.io/mul10 created
model.mlops.seldon.io/add10 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/mul10 condition met
model.mlops.seldon.io/add10 condition metcat ./pipelines/pipeline-inputs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: pipeline-inputs
spec:
steps:
- name: mul10
inputs:
- pipeline-inputs.inputs.INPUT0
tensorMap:
pipeline-inputs.inputs.INPUT0: INPUT
- name: add10
inputs:
- pipeline-inputs.inputs.INPUT1
tensorMap:
pipeline-inputs.inputs.INPUT1: INPUT
output:
steps:
- mul10
- add10
kubectl create -f ./pipelines/pipeline-inputs.yaml -n seldon-meshpipeline.mlops.seldon.io/pipeline-inputs createdkubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/pipeline-inputs condition metcurl -k http://<INGRESS_IP>:80/v2/models/pipeline-inputs/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: pipeline-inputs.pipeline" \
-d '{
"model_name": "pipeline",
"inputs": [
{
"name": "INPUT0",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
},
{
"name": "INPUT1",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
}
]
}' | jq -M .
{
"model_name": "",
"outputs": [
{
"data": [
10,
20,
30,
40
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
},
{
"data": [
11,
12,
13,
14
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}
seldon pipeline infer pipeline-inputs --inference-mode grpc --inference-host <INGRESS-IP>:80 \
'{"model_name":"pipeline","inputs":[{"name":"INPUT0","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]},{"name":"INPUT1","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
10,
20,
30,
40
]
}
},
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
11,
12,
13,
14
]
}
}
]
}
kubectl delete -f ./pipelines/pipeline-inputs.yaml -n seldon-meshkubectl delete -f ./models/mul10.yaml -n seldon-mesh
kubectl delete -f ./models/add10.yaml -n seldon-meshcat ./models/mul10.yaml
cat ./models/add10.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mul10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
requirements:
- triton
- python
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: add10
spec:
storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
requirements:
- triton
- python
kubectl create -f ./models/mul10.yaml -n seldon-mesh
kubectl create -f ./models/add10.yaml -n seldon-mesh
model.mlops.seldon.io/mul10 created
model.mlops.seldon.io/add10 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/mul10 condition met
model.mlops.seldon.io/add10 condition metcat ./pipelines/trigger-joins.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: trigger-joins
spec:
steps:
- name: mul10
inputs:
- trigger-joins.inputs.INPUT
triggers:
- trigger-joins.inputs.ok1
- trigger-joins.inputs.ok2
triggersJoinType: any
- name: add10
inputs:
- trigger-joins.inputs.INPUT
triggers:
- trigger-joins.inputs.ok3
output:
steps:
- mul10
- add10
stepsJoin: any
kubectl create -f ./pipelines/trigger-joins.yaml -n seldon-meshpipeline.mlops.seldon.io/trigger-joins createdkubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/trigger-joins condition metcurl -k http://<INGRESS_IP>:80/v2/models/trigger-joins/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: trigger-joins.pipeline" \
-d '{
"model_name": "pipeline",
"inputs": [
{
"name": "ok1",
"datatype": "FP32",
"shape": [1],
"data": [1]
},
{
"name": "INPUT",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
10,
20,
30,
40
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}seldon pipeline infer trigger-joins --inference-mode grpc --inference-host <INGRESS_IP>:80\
'{"model_name":"pipeline","inputs":[{"name":"ok1","contents":{"fp32_contents":[1]},"datatype":"FP32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
10,
20,
30,
40
]
}
}
]
}curl -k http://<INGRESS_IP>:80/v2/models/trigger-joins/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: trigger-joins.pipeline" \
-d '{
"model_name": "pipeline",
"inputs": [
{
"name": "ok3",
"datatype": "FP32",
"shape": [1],
"data": [1]
},
{
"name": "INPUT",
"datatype": "FP32",
"shape": [4],
"data": [1,2,3,4]
}
]
}' | jq -M .
{
"model_name": "",
"outputs": [
{
"data": [
11,
12,
13,
14
],
"name": "OUTPUT",
"shape": [
4
],
"datatype": "FP32"
}
]
}
seldon pipeline infer trigger-joins --inference-mode grpc --inference-host <INGRESS_IP>:80 \
'{"model_name":"pipeline","inputs":[{"name":"ok3","contents":{"fp32_contents":[1]},"datatype":"FP32","shape":[1]},{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .{
"outputs": [
{
"name": "OUTPUT",
"datatype": "FP32",
"shape": [
"4"
],
"contents": {
"fp32Contents": [
11,
12,
13,
14
]
}
}
]
}
kubectl delete -f ./pipelines/trigger-joins.yaml -n seldon-meshpipeline.mlops.seldon.io "trigger-joins" deletedkubectl delete -f ./models/mul10.yaml -n seldon-mesh
kubectl delete -f ./models/add10.yaml -n seldon-meshmodel.mlops.seldon.io "mul10" deleted
model.mlops.seldon.io "add10" deletedapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimples
spec:
steps:
- name: tfsimple1
- name: tfsimple2
inputs:
- tfsimple1
tensorMap:
tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple1.outputs.OUTPUT1: INPUT1
output:
steps:
- tfsimple2
This examples illustrates a series of Pipelines that are joined together.
Ensure that you have installed Seldon Core 2 in the namespace seldon-mesh.
Ensure that you are performing these steps in the directory where you have downloaded the .
Get the IP address of the Seldon Core 2 instance running with Istio:
gs://seldon-models/triton/simple an example Triton tensorflow model that takes 2
inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1
from INPUT0 to produce OUTPUT1. See
for the original source code and license.
Other models can be found at https://github.com/SeldonIO/triton-python-examples
ISTIO_INGRESS=$(kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Seldon Core 2: http://$ISTIO_INGRESS"cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl create -f ./pipelines/tfsimple.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat ./pipelines/tfsimple-extended.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimple.pipeline" \
-H "x-request-id: test-id" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .
{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --header x-request-id=test-id --inference-host <INGRESS_IP>:80 \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline inspect tfsimpleseldon.default.model.tfsimple1.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs test-id {"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs test-id {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon pipeline inspect tfsimple-extendedseldon.default.model.tfsimple2.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.model.tfsimple2.outputs test-id {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs test-id {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-meshkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshcat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl create -f ./pipelines/tfsimple.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended2
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-combined
spec:
input:
externalInputs:
- tfsimple-extended.outputs.OUTPUT0
- tfsimple-extended2.outputs.OUTPUT1
tensorMap:
tfsimple-extended.outputs.OUTPUT0: INPUT0
tfsimple-extended2.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-combined.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended created
pipeline.mlops.seldon.io/tfsimple-extended2 created
pipeline.mlops.seldon.io/tfsimple-combined created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended condition met
pipeline.mlops.seldon.io/tfsimple-extended2 condition met
pipeline.mlops.seldon.io/tfsimple-combined condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimple.pipeline" \
-H "x-request-id: test-id2" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --header x-request-id=test-id2 --inference-host <INGRESS_IP>:80 \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline inspect tfsimpleseldon.default.model.tfsimple1.inputs test-id2 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs test-id2 {"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs test-id2 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs test-id2 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon pipeline inspect tfsimple-extended --offset 2 --verboseseldon.default.model.tfsimple2.inputs test-id2 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]} x-request-id=[test-id2] x-forwarded-proto=[http] x-seldon-route=[:tfsimple1_1:] x-envoy-upstream-service-time=[1] pipeline=[tfsimple-extended] traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-e468d06afdab8f52-01] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id] x-forwarded-proto=[http] x-envoy-upstream-service-time=[5] x-seldon-route=[:tfsimple1_1:]
seldon.default.model.tfsimple2.outputs test-id2 {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]} x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id2] x-forwarded-proto=[http] x-seldon-route=[:tfsimple1_1: :tfsimple2_1:] x-envoy-upstream-service-time=[1] pipeline=[tfsimple-extended] traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-73bd1ee54a94d8fb-01]
seldon.default.pipeline.tfsimple-extended.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]} pipeline=[tfsimple-extended] traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-fee12926788ce3b6-01] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id] x-forwarded-proto=[http] x-envoy-upstream-service-time=[5] x-seldon-route=[:tfsimple1_1:]
seldon.default.pipeline.tfsimple-extended.inputs test-id2 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]} x-forwarded-proto=[http] x-seldon-route=[:tfsimple1_1:] x-envoy-upstream-service-time=[1] pipeline=[tfsimple-extended] traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-4df8459a992e0278-01] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id2]
seldon.default.pipeline.tfsimple-extended.outputs test-id {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]} pipeline=[tfsimple-extended] traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-b2f899a739c5cafd-01] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id] x-forwarded-proto=[http] x-envoy-upstream-service-time=[5] x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]
seldon.default.pipeline.tfsimple-extended.outputs test-id2 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]} x-envoy-upstream-service-time=[1] pipeline=[tfsimple-extended] traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-dfa399143feec23d-01] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[test-id2] x-forwarded-proto=[http] x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]
seldon pipeline inspect tfsimple-extended2 --offset 2seldon.default.pipeline.tfsimple-extended2.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended2.outputs test-id {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon pipeline inspect tfsimple-combinedseldon.default.model.tfsimple2.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs test-id {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple-combined.inputs test-id {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined.outputs test-id {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
kubectl delete -f ./pipelines/tfsimple-combined.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-meshkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshcat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl create -f ./pipelines/tfsimple.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined-trigger.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended2
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-combined-trigger
spec:
input:
externalInputs:
- tfsimple-extended.outputs
externalTriggers:
- tfsimple-extended2.outputs
tensorMap:
tfsimple-extended.outputs.OUTPUT0: INPUT0
tfsimple-extended.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-combined-trigger.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended created
pipeline.mlops.seldon.io/tfsimple-extended2 created
pipeline.mlops.seldon.io/tfsimple-combined-trigger created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended condition met
pipeline.mlops.seldon.io/tfsimple-extended2 condition met
pipeline.mlops.seldon.io/tfsimple-combined-trigger condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimple.pipeline" \
-H "x-request-id: test-id3" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --header x-request-id=test-id3 --inference-host <INGRESS_IP>:80 \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline inspect tfsimpleseldon.default.model.tfsimple1.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs test-id3 {"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon pipeline inspect tfsimple-extended --offset 2seldon.default.model.tfsimple2.outputs test-id3 {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon pipeline inspect tfsimple-extended2 --offset 2seldon.default.model.tfsimple2.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended2.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon pipeline inspect tfsimple-combined-triggerseldon.default.model.tfsimple2.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs test-id3 {"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple-combined-trigger.inputs test-id3 {"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined-trigger.outputs test-id3 {"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
kubectl delete -f ./pipelines/tfsimple-combined-trigger.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-meshkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshcat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl create -f ./pipelines/tfsimple.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat ./pipelines/tfsimple-extended-step.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended-step
spec:
input:
externalInputs:
- tfsimple.step.tfsimple1.outputs
tensorMap:
tfsimple.step.tfsimple1.outputs.OUTPUT0: INPUT0
tfsimple.step.tfsimple1.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimple-extended-step.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended-step createdkubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended-step condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Seldon-Model: tfsimple.pipeline" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple \
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline inspect tfsimple --verboseseldon.default.model.tfsimple1.inputs cg5g6ogfh5ss73a44vvg {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]} pipeline=[tfsimple] traceparent=[00-2c66ff815d920ad238365be52a4467f5-90824e4cb70c3242-01] x-forwarded-proto=[http] x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[cg5g6ogfh5ss73a44vvg]
seldon.default.model.tfsimple1.outputs cg5g6ogfh5ss73a44vvg {"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]} x-request-id=[cg5g6ogfh5ss73a44vvg] pipeline=[tfsimple] x-envoy-upstream-service-time=[8] x-seldon-route=[:tfsimple1_1:] traceparent=[00-2c66ff815d920ad238365be52a4467f5-ca023a540fa463b3-01] x-forwarded-proto=[http] x-envoy-expected-rq-timeout-ms=[60000]
seldon.default.pipeline.tfsimple.inputs cg5g6ogfh5ss73a44vvg {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]} pipeline=[tfsimple] x-request-id=[cg5g6ogfh5ss73a44vvg] traceparent=[00-2c66ff815d920ad238365be52a4467f5-843d6ce39292396d-01] x-forwarded-proto=[http] x-envoy-expected-rq-timeout-ms=[60000]
seldon.default.pipeline.tfsimple.outputs cg5g6ogfh5ss73a44vvg {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]} x-envoy-expected-rq-timeout-ms=[60000] x-request-id=[cg5g6ogfh5ss73a44vvg] x-envoy-upstream-service-time=[8] x-seldon-route=[:tfsimple1_1:] pipeline=[tfsimple] traceparent=[00-2c66ff815d920ad238365be52a4467f5-ee7527353e9fe5a2-01] x-forwarded-proto=[http]
seldon pipeline inspect tfsimple-extended-stepseldon.default.model.tfsimple2.inputs cg5g6ogfh5ss73a44vvg {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.model.tfsimple2.outputs cg5g6ogfh5ss73a44vvg {"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
seldon.default.pipeline.tfsimple-extended-step.inputs cg5g6ogfh5ss73a44vvg {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended-step.outputs cg5g6ogfh5ss73a44vvg {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
kubectl delete -f ./pipelines/tfsimple-extended-step.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-meshkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-meshcat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple2
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
kubectl create -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl create -f ./models/tfsimple2.yaml -n seldon-meshmodel.mlops.seldon.io/tfsimple1 created
model.mlops.seldon.io/tfsimple2 createdkubectl wait --for condition=ready --timeout=300s model --all -n seldon-meshmodel.mlops.seldon.io/tfsimple1 condition met
model.mlops.seldon.io/tfsimple2 condition metcat ./pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
kubectl create -f ./pipelines/tfsimple.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined-step.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-extended2
spec:
input:
externalInputs:
- tfsimple.outputs
tensorMap:
tfsimple.outputs.OUTPUT0: INPUT0
tfsimple.outputs.OUTPUT1: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple-combined-step
spec:
input:
externalInputs:
- tfsimple-extended.step.tfsimple2.outputs.OUTPUT0
- tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0
tensorMap:
tfsimple-extended.step.tfsimple2.outputs.OUTPUT0: INPUT0
tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0: INPUT1
steps:
- name: tfsimple2
output:
steps:
- tfsimple2
kubectl create -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
kubectl create -f ./pipelines/tfsimple-combined-step.yaml -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended created
pipeline.mlops.seldon.io/tfsimple-extended2 created
pipeline.mlops.seldon.io/tfsimple-combined-step created
kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-meshpipeline.mlops.seldon.io/tfsimple-extended condition met
pipeline.mlops.seldon.io/tfsimple-extended2 condition met
pipeline.mlops.seldon.io/tfsimple-combined-step condition metcurl -k http://<INGRESS_IP>:80/v2/models/tfsimple/infer \
-H "Host: seldon-mesh.inference.seldon" \
-H "Content-Type: application/json" \
-H "Seldon-Model: tfsimple.pipeline" \
-d '{
"inputs": [
{
"name": "INPUT0",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
},
{
"name": "INPUT1",
"datatype": "INT32",
"shape": [1,16],
"data": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}
]
}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}seldon pipeline infer tfsimple --inference-host <INGRESS_IP>:80\
'{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
seldon pipeline inspect tfsimpleseldon.default.model.tfsimple1.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
seldon.default.model.tfsimple1.outputs cg5g710fh5ss73a4500g {"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
seldon.default.pipeline.tfsimple.outputs cg5g710fh5ss73a4500g {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon pipeline inspect tfsimple-extendedseldon.default.model.tfsimple2.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs cg5g710fh5ss73a4500g {"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-extended.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs cg5g710fh5ss73a4500g {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
seldon pipeline inspect tfsimple-extended2seldon.default.model.tfsimple2.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs cg5g710fh5ss73a4500g {"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-extended2.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs cg5g710fh5ss73a4500g {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
seldon pipeline inspect tfsimple-combined-stepseldon.default.model.tfsimple2.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs cg5g710fh5ss73a4500g {"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-combined-step.inputs cg5g710fh5ss73a4500g {"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined-step.outputs cg5g710fh5ss73a4500g {"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
kubectl delete -f ./pipelines/tfsimple-extended.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-extended2.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple-combined-step.yaml -n seldon-mesh
kubectl delete -f ./pipelines/tfsimple.yaml -n seldon-meshkubectl delete -f ./models/tfsimple1.yaml -n seldon-mesh
kubectl delete -f ./models/tfsimple2.yaml -n seldon-mesh



