We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.
For local storage while developing see here.
For authorization needed for cloud storage when running on Kubernetes see here.
The Model specification allows parameters to be passed to the loaded model to allow customization. For example:
This capability is only available for MLServer custom model runtimes. The named keys and values will be added to the model-settings.json file for the provided model in the parameters.extra
Dict. MLServer models are able to read these values in their load
method.
This model allows a Pandas query to be run in the input to select rows. An example is shown below:
This invocation check filters for tensor A having value 1.
The model also returns a tensor called status
which indicates the operation run and whether it was a success. If no rows satisfy the query then just a status
tensor output will be returned.
Further details on Pandas query can be found here
This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:
and
By including these in a Pipeline as follows we can define conditional routes:
Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.
The full notebook can be found here
To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.
Type | Server | Tag | Example |
---|---|---|---|
For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.
For MLServer targeted models you can create a model-settings.json
file to help MLServer load your model and place this alongside your artifact. See the MLServer project for details.
For Triton inference server models you can create a configuration config.pbtxt file alongside your artifact.
The tag
field represents the tag you need to add to the requirements
part of the Model spec for your artifact to be loaded on a compatible server. e.g. for an sklearn model:
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.
This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.
Multi-model serving is enabled by design in Seldon Core 2. Based on requirements that are specified by the user on a given Model
, the Scheduler will find an appropriate model inference server instance to load the model onto.
In the below example, given that the model is tensorflow
, the system will deploy the model onto a triton
server instance (matching with the Server
labels). Additionally as the model memory
requirement is 100Ki
, the system will pick the server instance that has enough (memory) capacity to host this model in parallel to other potentially existing models.
All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.
Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.
Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE
on shared servers; it is set by default at 10%. In other words a given model inference server instance can register models with a total memory requirement up to MEMORY_REQUEST
* ( 1 + SELDON_OVERCOMMIT_PERCENTAGE
/ 100).
The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:
A new model load request beyond the active memory capacity of the inference server.
An incoming inference request to a registered model that is not loaded in-memory (previously evicted).
This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).
Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE
to 0 for a given shared server.
Note: Currently we are using memory requirement values that are specified by the user on the Server and Model side. In the future we are looking at how to make the system automatically handle memory management.
Check this notebook for a local example.
Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.
Seldon can handle a wide range of
Artifacts can be stored on any of the 40 or more cloud storage technologies as well as from local (mounted) folder as discussed .
A Kubernetes yaml example is shown below for a SKLearn model for iris classification:
Its Kubernetes spec
has two core requirements
A storageUri
specifying the location of the artifact. This can be any rclone URI specification.
A requirements
list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.
You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.
Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying models. The system will find an appropriate server to load the model onto based on requirements that the user defines in the Model
deployment definition.
Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.
Type | Notes | Custom Model Settings |
---|---|---|
The proto buffer definitions for the scheduler are outlined .
See for more information.
See for discussion of autoscaling of models.
Alibi-Detect
MLServer
alibi-detect
Alibi-Explain
MLServer
alibi-explain
DALI
Triton
dali
TBC
Huggingface
MLServer
huggingface
LightGBM
MLServer
lightgbm
MLFlow
MLServer
mlflow
ONNX
Triton
onnx
OpenVino
Triton
openvino
TBC
Custom Python
MLServer
python, mlserver
Custom Python
Triton
python, triton
PyTorch
Triton
pytorch
SKLearn
MLServer
sklearn
Spark Mlib
MLServer
spark-mlib
TBC
Tensorflow
Triton
tensorflow
TensorRT
Triton
tensorrt
TBC
Triton FIL
Triton
fil
TBC
XGBoost
MLServer
xgboost
Alibi-Detect
Alibi-Explain
DALI
Follow the Triton docs to create a config.pbtxt and model folder with artifact.
Huggingface
Create an MLServer model-settings.json
with the Huggingface model required
LightGBM
Save model to file with extension.bst
.
MLFlow
Use the created artifacts/model
folder from your training run.
ONNX
Save you model with name model.onnx
.
OpenVino
Follow the Triton docs to create your model artifacts.
Custom MLServer Python
Create a python file with a class that extends MLModel
.
Custom Triton Python
Follow the Triton docs to create your config.pbtxt
and associated python files.
PyTorch
Create a Triton config.pbtxt
describing inputs and outputs and place traced torchscript in folder as model.pt
.
SKLearn
Save model via joblib to a file with extension .joblib
or with pickle to a file with extension .pkl
or .pickle
.
Spark Mlib
Follow the MLServer docs.
Tensorflow
Save model in "Saved Model" format as model.savedodel
. If using graphdef format you will need to create Triton config.pbtxt and place your model in a numbered sub folder. HDF5 is not supported.
TensorRT
Follow the Triton docs to create your model artifacts.
Triton FIL
Follow the Triton docs to create your model artifacts.
XGBoost
Save model to file with extension.bst
or .json
.