Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
To learn more about how you can write custom runtimes with MLServer, check out the Custom Runtimes user guide. Alternatively, you can also see this end-to-end example which walks through the process of writing a custom runtime.
No description available.
Logs a new set of metric values. Each kwarg of this method will be treated as a separate metric / value pair. If any of the metrics does not exist, a new one will be created with a default description.
Registers a new metric with its description. If the metric already exists, it will just return the existing one.
This package provides a MLServer runtime compatible with MLflow models.
You can install the runtime, alongside mlserver, as:
pip install mlserver mlserver-mlflowThe MLflow inference runtime introduces a new dict content type, which decodes an incoming V2 request as a . This is useful for certain MLflow-serialised models, which will expect that the model inputs are serialised in this format.
MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:
MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including and . This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.
Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.
Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.
The streaming endpoints are available for both the infer and generate methods through the following endpoints:
MLServer exposes a Python framework to build custom inference runtimes, define request/response types, plug codecs for payload conversion, and emit metrics. This page provides a high-level overview and links to the API docs.
Base class to implement custom inference runtimes.
Core lifecycle: load()
To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.
If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).
/v2/models/{model_name}/versions/{model_version}/infer_stream
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/versions/{model_version}/generate_stream
/v2/models/{model_name}/generate_stream
Note that for REST, the generate and generate_stream endpoints are aliases for the infer and infer_stream endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).
Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.
The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver already has the predict method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.
The stub method for streaming to be used by the client is ModelStreamInfer.
There are three main limitations of the streaming support in MLServer:
the parallel_workers setting should be set to 0 to disable distributed workers (to be addressed in future releases)
for REST, the gzip_enabled setting should be set to false to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue here)
On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.
The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.
Alongside the general API documentation, MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.
The model-specific autogenerated Swagger UI can be accessed under the following endpoints:
/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs
Open Inference Protocol
Main dataplane for inference, health and metadata
Model Repository Extension
Extension to the protocol to provide a control plane which lets you load / unload models dynamically
predict()unload()Helpers for encoding/decoding requests and responses.
Access to model metadata and settings.
Extend this class to implement your own model logic.
Data structures and enums for the V2 inference protocol.
Includes Pydantic models like InferenceRequest, InferenceResponse, RequestInput, ResponseOutput.
See model fields (type and default) and JSON Schemas in the docs.
Encode/decode payloads between Open Inference Protocol types and Python types.
Base classes: InputCodec (inputs/outputs) and RequestCodec (requests/responses).
Built-ins include codecs such as NumpyCodec, Base64Codec, StringCodec, etc.
Emit and configure metrics within MLServer.
Use log() to record custom metrics; see server lifecycle hooks and utilities.
The `dict` content type can be _stacked_ with other content types, like
[`np`](../../docs/user-guide/content-type).
This allows the user to use a different set of content types to decode each of
the dict entries.If no content type is present on the request or metadata, the CatBoost runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.
pip install mlserver mlserver-catboostpip install mlserver mlserver-alibi-explainThis package provides a MLServer runtime compatible with Spark MLlib.
You can install the runtime, alongside mlserver, as:
pip install mlserver mlserver-mllibFor further information on how to use MLServer with Spark MLlib, you can check out the MLServer repository.
The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, content type annotations at the request level will be ignored. Note that this doesn't include input-level content type annotations, which will be respected as usual.
The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.
It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri in model-settings.json.
Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model in model-settings.json.
You can find the full reference of the accepted extra settings for the HuggingFace runtime below:
on_worker_stop(worker: Worker) -> Nonestart()stop(sig: Optional[int] = None)configure_metrics(settings: Settings)log(metrics)register(name: str, description: str) -> Histogrampip install mlserver mlserver-huggingface---
emphasize-lines: 5-8
---
{
"name": "qa",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "question-answering",
"optimum_model": true
}
}
}These settings can also be injected through environment variables prefixed with `MLSERVER_MODEL_HUGGINGFACE_`, e.g.
```bash
MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
```If `parameters.extra.pretrained_model` is specified, it takes precedence over `parameters.uri`.
.. autopydantic_settings:: mlserver_huggingface.settings.HuggingFaceSettingsTo see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.
Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).
This package provides a MLServer runtime compatible with LightGBM.
You can install the runtime, alongside mlserver, as:
pip install mlserver mlserver-lightgbmFor further information on how to use MLServer with LightGBM, you can check out this worked out example.
If no is present on the request or metadata, the LightGBM runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
Abstract inference runtime which exposes the main interface to interact with ML models.
Helper to decode a request input into its corresponding high-level Python object. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata and the input's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.
Helper to decode an inference request into its corresponding high-level Python object. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the model's metadata and the requests's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.
Helper to encode a high-level Python object into its corresponding response output. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata, request output's content type or payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.
Helper to encode a high-level Python object into its corresponding inference response. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.
Method responsible for loading the model from a model artefact. This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). Its return value will represent the model's readiness status. A return value of True will mean the model is ready.
This method can be overriden to implement your custom load logic.
No description available.
Method responsible for running inference on the model.
This method can be overriden to implement your custom inference logic.
Method responsible for running generation on the model, streaming a set of responses back to the client.
This method can be overriden to implement your custom inference logic.
Method responsible for unloading the model, freeing any resources (e.g. CPU memory, GPU memory, etc.). This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). A return value of True will mean the model is now unloaded.
This method can be overriden to implement your custom unload logic.
Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the below.
By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the .
The is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a
This page links to the key reference docs for configuring and using MLServer.
Server-wide configuration (e.g., HTTP/GRPC ports) loaded from a settings.json in the working directory. Settings can also be provided via environment variables prefixed with MLSERVER_ (e.g., MLSERVER_GRPC_PORT).
MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for , letting you change at runtime which models is MLServer currently serving.
The API to manage the model repository is modelled after to the V2 Dataplane and is thus fully compatible with it.
This notebook will walk you through an example using the Model Repository API.
First of all, we will need to train some models. For that, we will re-use the models we trained previously in the . You can check the details on how they are trained following that notebook.
decode(request_input: RequestInput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> AnyWhen we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.
To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.
Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.
The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.
Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.
For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.
By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.
The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:
N, where N > 0, will create a pool of N workers.
0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.
The inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.
Complementing the inference_pool_gid, if the autogenerate_inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID global environment variable) is set to True, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.
Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.
Scope: server-wide (independent from model-specific settings)
Sources: settings.json or env vars MLSERVER_*
Each model has its own configuration (metadata, parallelism, etc.). Typically provided via a model-settings.json next to the model artifacts. Alternatively, use env vars prefixed with MLSERVER_MODEL_ (e.g., MLSERVER_MODEL_IMPLEMENTATION). If no model-settings.json is found, MLServer will try to load a default model from these env vars. Note: these env vars are shared across models unless overridden by model-settings.json.
Scope: per-model
Sources: model-settings.json or env vars MLSERVER_MODEL_*
The mlserver CLI helps with common model lifecycle tasks (build images, init projects, start serving, etc.). For a quick overview:
Commands include: build, dockerfile, infer (deprecated), init, start
Each command lists its options, arguments, and examples
Build custom runtimes and integrate with MLServer using Python:
MLModel: base class for custom inference runtimes
Types: request/response schemas and enums (Pydantic)
Codecs: payload conversions between protocol types and Python types
Metrics: emit and configure metrics
If no content type is present on the request or metadata, the Alibi-Detect runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.
The Alibi Detect runtime exposes a couple setting flags which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.
You can find the full reference of the accepted extra settings for the Alibi Detect runtime below:
Next up, we will start our mlserver inference server. Note that, by default, this will load all our models.
Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.
As we can, the repository lists 2 models (i.e. mushroom-xgboost and mnist-svm). Note that the state for both is set to READY. This means that both models are loaded, and thus ready for inference.
We will now try to unload one of the 2 models, mushroom-xgboost. This will unload the model from the inference server but will keep it available on our model repository.
If we now try to list the models available in our repository, we will see that the mushroom-xgboost model is flagged as UNAVAILABLE. This means that it's present in the repository but it's not loaded for inference.
We will now load our model back into our inference server.
If we now try to list the models again, we will see that our mushroom-xgboost is back again, ready for inference.
decode_request(inference_request: InferenceRequest, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> Anyencode(payload: Any, request_output: RequestOutput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> ResponseOutputencode_response(payload: Any, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> InferenceResponseload() -> boolmetadata() -> MetadataModelResponsepredict(payload: InferenceRequest) -> InferenceResponsepredict_stream(payloads: AsyncIterator[InferenceRequest]) -> AsyncIterator[InferenceResponse]unload() -> boolmlserver --helppip install mlserver mlserver-alibi-detect---
emphasize-lines: 6-8
---
{
"name": "drift-detector",
"implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
"parameters": {
"uri": "./alibi-detect-artifact/",
"extra": {
"batch_size": 5
}
}
}
.. autopydantic_settings:: mlserver_alibi_detect.runtime.AlibiDetectSettings!cp -r ../mms/models/* ./modelsmlserver start .import requests
response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/unload")response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/load")response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()If no content type is present on the request or metadata, the Scikit-Learn runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.
The Scikit-Learn inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict, predict_proba and transform methods of the Scikit-Learn model.
predict
✅
Available on most models, but not in .
predict_proba
❌
Only available on non-regressor models.
transform
❌
Only available on .
By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.
For example, to only return the model's predict_proba output, you could define a payload such as:
Out of the box, mlserver supports the deployment and serving of lightgbm models. By default, it will assume that these models have been serialised using the bst.save_model() method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.
To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.
Our model will be persisted as a file named iris-lightgbm.bst.
Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
As we can see above, the model predicted the probability for each class, and the probability of class 1 is the biggest, close to 0.99, which matches what's on the test set.
Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice.
Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common ML frameworks. This allows you to start serving models saved in these frameworks straight away. To avoid bringing in dependencies for frameworks that you don't need to use, these runtimes are implemented as independent (and optional) Python packages. This mechanism also allows you to rollout your very easily.
To pick which runtime you want to use for your model, you just need to make sure that the right package is installed, and then point to the correct runtime class in your model-settings.json file.
MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".
There are usually two main reasons to adopt adaptive batching:
Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
pip install mlserver mlserver-sklearn---
emphasize-lines: 10-12
---
{
"inputs": [
{
"name": "my-input",
"datatype": "INT32",
"shape": [2, 2],
"data": [1, 2, 3, 4]
}
],
"outputs": [
{ "name": "predict_proba" }
]
}import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import os
model_dir = "."
BST_FILE = "iris-lightgbm.bst"
iris = load_iris()
y = iris['target']
X = iris['data']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
dtrain = lgb.Dataset(X_train, label=y_train)
params = {
'objective':'multiclass',
'metric':'softmax',
'num_class': 3
}
lgb_model = lgb.train(params=params, train_set=dtrain)
model_file = os.path.join(model_dir, BST_FILE)
lgb_model.save_model(model_file)environment_tarball
Optional[str]
None
Path to the environment tarball which should be used to load this model.
extra
Optional[dict]
<factory>
Arbitrary settings, dependent on the inference runtime implementation.
format
Optional[str]
None
Format of the model (only available on certain runtimes).
inference_pool_gid
Optional[str]
None
Inference pool group id to be used to serve this model.
uri
Optional[str]
None
URI where the model artifacts can be found. This path must be either absolute or relative to where MLServer is running.
version
Optional[str]
None
Version of the model.
env_prefix
str
"MLSERVER_MODEL_"
env_file
str
".env"
protected_namespaces
tuple
('model_', 'settings_')
autogenerate_inference_pool_gid
bool
False
Flag to autogenerate the inference pool group id for this model.
content_type
Optional[str]
None
Default content type to use for requests and responses.
environment_path
Optional[str]
None
Path to a directory that contains the python environment to be used to load this model.
--version (Default: False) Show the version and exit.
Build a Docker image for a custom MLServer runtime.
-t, --tag <text>
--no-cache (Default: False)
FOLDER Required argument
Generate a Dockerfile
-i, --include-dockerignore (Default: False)
FOLDER Required argument
Deprecated: This experimental feature will be removed in future work. Execute batch inference requests against V2 inference server.
Deprecated: This experimental feature will be removed in future work.
--url, -u <text> (Default: localhost:8080; Env: MLSERVER_INFER_URL) URL of the MLServer to send inference requests to. Should not contain http or https.
--model-name, -m <text> (Required; Env: MLSERVER_INFER_MODEL_NAME) Name of the model to send inference requests to.
--input-data-path, -i <path> (Required; Env: MLSERVER_INFER_INPUT_DATA_PATH) Local path to the input file containing inference requests to be processed.
--output-data-path, -o <path> (Required; Env: MLSERVER_INFER_OUTPUT_DATA_PATH) Local path to the output file for the inference responses to be written to.
--workers, -w <integer> (Default: 10; Env: MLSERVER_INFER_WORKERS)
--retries, -r <integer> (Default: 3; Env: MLSERVER_INFER_RETRIES)
--batch-size, -s <integer> (Default: 1; Env: MLSERVER_INFER_BATCH_SIZE) Send inference requests grouped together as micro-batches.
--binary-data, -b (Default: False; Env: MLSERVER_INFER_BINARY_DATA) Send inference requests as binary data (not fully supported).
--verbose, -v (Default: False; Env: MLSERVER_INFER_VERBOSE) Verbose mode.
--extra-verbose, -vv (Default: False; Env: MLSERVER_INFER_EXTRA_VERBOSE) Extra verbose mode (shows detailed requests and responses).
--transport, -t <choice> (Options: rest | grpc; Default: rest; Env: MLSERVER_INFER_TRANSPORT) Transport type to use to send inference requests. Can be 'rest' or 'grpc' (not yet supported).
--request-headers, -H <text> (Env: MLSERVER_INFER_REQUEST_HEADERS) Headers to be set on each inference request send to the server. Multiple options are allowed as: -H 'Header1: Val1' -H 'Header2: Val2'. When setting up as environmental provide as 'Header1:Val1 Header2:Val2'.
--timeout <integer> (Default: 60; Env: MLSERVER_INFER_CONNECTION_TIMEOUT) Connection timeout to be passed to tritonclient.
--batch-interval <float> (Default: 0; Env: MLSERVER_INFER_BATCH_INTERVAL) Minimum time interval (in seconds) between requests made by each worker.
--batch-jitter <float> (Default: 0; Env: MLSERVER_INFER_BATCH_JITTER) Maximum random jitter (in seconds) added to batch interval between requests.
--use-ssl (Default: False; Env: MLSERVER_INFER_USE_SSL) Use SSL in communications with inference server.
--insecure (Default: False; Env: MLSERVER_INFER_INSECURE) Disable SSL verification in communications. Use with caution.
Generate a base project template
-t, --template <text> (Default: https://github.com/EthicalML/sml-security/)
Start serving a machine learning model with MLServer.
FOLDER Required argument
%%writefile settings.json
{
"debug": "true"
}%%writefile model-settings.json
{
"name": "iris-lgb",
"implementation": "mlserver_lightgbm.LightGBMModel",
"parameters": {
"uri": "./iris-lightgbm.bst",
"version": "v0.1.0"
}
}mlserver start .import requests
x_0 = X_test[0:1]
inference_request = {
"inputs": [
{
"name": "predict-prob",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/iris-lgb/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()y_test[0]mlserver --helpmlserver [OPTIONS] COMMAND [ARGS]...mlserver build [OPTIONS] FOLDERmlserver dockerfile [OPTIONS] FOLDERmlserver infer [OPTIONS]mlserver init [OPTIONS]mlserver start [OPTIONS] FOLDERScikit-Learn
mlserver-sklearn
mlserver_sklearn.SKLearnModel
XGBoost
mlserver-xgboost
mlserver_xgboost.XGBoostModel
Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.
However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.
MLServer lets you configure adaptive batching independently for each model through two main parameters:
Maximum batch size, that is how many requests you want to group together.
Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.
The max_batch_size field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:
N, where N > 1, will create batches of up to N elements.
0 or 1, will disable adaptive batching.
The max_batch_time field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.
The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5.
The expected values are:
T, where T > 0, will wait T seconds at most.
0, will disable adaptive batching.
MLserver allows adding custom parameters to the parameters field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.
is received as follows in the batched request in the server:
The same way if the request is sent back from the server as a batched request
it will be returned unbatched from the server as follows:
str
"MLSERVER_MODEL_"
env_file
str
".env"
protected_namespaces
tuple
('model_', 'settings_')
cache_enabled
bool
False
Enable caching for a specific model. This parameter can be used to disable cache for a specific model, if the server-level caching is enabled. If the server-level caching is disabled, this parameter value will have no effect.
implementation_
str
-
Python path to the inference runtime to use to serve this model (e.g. mlserver_sklearn.SKLearnModel).
inputs
List[MetadataTensor]
<factory>
extra
str
"ignore"
env_prefix
KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.
Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.
To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.
For example, to serve a Scikit-Learn model, you could use a manifest like the one below:
As you can see highlighted above, the InferenceService manifest will only need to specify the following points:
The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion field to v2.
Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:
As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.
Scikit-Learn
sklearn
XGBoost
xgboost
Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.
Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.
The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:
As we can see highlighted above, the main points that we'll need to take into account are:
Pointing to our custom MLServer image in the custom container section of our InferenceService.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.
Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:
Out of the box, mlserver supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver.
The first step will be to fetch a reference data and other relevant metadata for an alibi-detect model.
For that, we will use the alibi library to get the adult dataset with .
This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the documentation.
Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our alibi-detect model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
This package provides a MLServer runtime compatible with XGBoost.
You can install the runtime, alongside mlserver, as:
For further information on how to use MLServer with XGBoost, you can check out this worked out example.
The XGBoost inference runtime will expect that your model is serialised via one of the following methods:
If no is present on the request or metadata, the XGBoost runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict and predict_proba methods of the XGBoost model.
By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.
For example, to only return the model's predict_proba output, you could define a payload such as:
Out of the box, mlserver supports the deployment and serving of xgboost models. By default, it will assume that these models have been serialised using the bst.save_model() method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.
The first step will be to train a simple xgboost model. For that, we will use the .
To save our trained model, we will serialise it using bst.save_model() and the JSON format. This is the .
Our model will be persisted as a file named mushroom-xgboost.json.
Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as close to 0, which matches what's on the test set.
Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.
On top of these, you can also register and track your own custom metrics as part of your custom inference runtimes.
By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for adaptive batching and communication with the inference workers.
On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.
On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.
MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register() and mlserver.log() methods.
mlserver.register: Register a new metric.
mlserver.log: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.
Custom metrics will generally be registered in the load() <mlserver.MLModel.load> method and then used in the predict() <mlserver.MLModel.predict> method of your .
For metrics specific to a model (e.g. , request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.
Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:
MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by or other -compatible backends.
Below you can find the available to control the behaviour of the metrics server:
Out of the box, mlserver supports the deployment and serving of scikit-learn models. By default, it will assume that these models have been serialised using joblib.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.
The first step will be to train a simple scikit-learn model. For that, we will use the which trains an SVM model.
To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the .
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as the number 8, which matches what's on the test set.
It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use , but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).
In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver Docker image. To cover these cases, the seldonio/mlserver Docker image allows you to load custom environments before starting the server itself.
This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver Docker image.
# request 1
types.RequestInput(
name="parameters-np",
shape=[1],
datatype="BYTES",
data=[],
parameters=types.Parameters(
custom-param='value-1',
)
)
# request 2
types.RequestInput(
name="parameters-np",
shape=[1],
datatype="BYTES",
data=[],
parameters=types.Parameters(
custom-param='value-2',
)
)types.RequestInput(
name="parameters-np",
shape=[2],
datatype="BYTES",
data=[],
parameters=types.Parameters(
custom-param=['value-1', 'value-2'],
)
)types.ResponseOutput(
name="foo",
datatype="INT32",
shape=[3, 3],
data=[1, 2, 3, 4, 5, 6, 7, 8, 9],
parameters=types.Parameters(
content_type="np",
foo=["foo_1", "foo_2"],
bar=["bar_1", "bar_2", "bar_3"],
),
)# Request 1
types.ResponseOutput(
name="foo",
datatype="INT32",
shape=[1, 3],
data=[1, 2, 3],
parameters=types.Parameters(
content_type="np", foo="foo_1", bar="'bar_1"
),
)
# Request 2
types.ResponseOutput(
name="foo",
datatype="INT32",
shape=[1, 3],
data=[4, 5, 6],
parameters=types.Parameters(
content_type="np", foo="foo_2", bar="bar_2"
),
)
# Request 3
types.ResponseOutput(
name="foo",
datatype="INT32",
shape=[1, 3],
data=[7, 8, 9],
parameters=types.Parameters(content_type="np", bar="bar_3"),
)apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
sklearn:
protocolVersion: v2
storageUri: gs://seldon-models/sklearn/iriskubectl apply -f my-inferenceservice-manifest.yamlapiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
containers:
- name: classifier
image: my-custom-server:0.1.0
env:
- name: PROTOCOL
value: v2
ports:
- containerPort: 8080
protocol: TCPkubectl apply -f my-inferenceservice-manifest.yamlpip install mlserver mlserver-xgboostMetadata about the inputs accepted by the model.
max_batch_size
int
0
When adaptive batching is enabled, maximum number of requests to group together in a single batch.
max_batch_time
float
0.0
When adaptive batching is enabled, maximum amount of time (in seconds) to wait for enough requests to build a full batch.
name
str
''
Name of the model.
outputs
List[MetadataTensor]
<factory>
Metadata about the outputs returned by the model.
parallel_workers
Optional[int]
None
Use the parallel_workers field in the server-wide settings instead.
parameters
Optional[ModelParameters]
None
Extra parameters for each instance of this model.
platform
str
''
Framework used to train and serialise the model (e.g. sklearn).
versions
List[str]
<factory>
Versions of dependencies used to train the model (e.g. sklearn/0.20.1).
warm_workers
bool
False
Inference workers will now always be warmed up at start time.
Spark MLlib
mlserver-mllib
mlserver_mllib.MLlibModel
LightGBM
mlserver-lightgbm
mlserver_lightgbm.LightGBMModel
CatBoost
mlserver-catboost
mlserver_catboost.CatboostModel
MLflow
mlserver-mlflow
mlserver_mlflow.MLflowRuntime
Alibi-Detect
mlserver-alibi-detect
mlserver_alibi_detect.AlibiDetectRuntime
*.json
booster.save_model("model.json")
*.ubj
booster.save_model("model.ubj")
*.bst
booster.save_model("model.bst")
predict
✅
Available on all XGBoost models.
predict_proba
❌
Only available on non-regressor models (i.e. XGBClassifier models).
model_infer_request_success
Number of successful inference requests.
model_infer_request_failure
Number of failed inference requests.
batch_request_queue
Queue size for the adaptive batching queue.
parallel_request_queue
Queue size for the inference workers queue.
[rest_server]_requests
Number of REST requests, labelled by endpoint and status code.
[rest_server]_requests_duration_seconds
Latency of REST requests.
[rest_server]_requests_in_progress
Number of in-flight REST requests.
grpc_server_handled
Number of gRPC requests, labelled by gRPC code and method.
grpc_server_started
Number of in-flight gRPC requests.
model_name
Model Name (e.g. my-custom-model)
model_version
Model Version (e.g. v1.2.3)
metrics_endpoint
Path under which the metrics endpoint will be exposed.
/metrics
metrics_port
Port used to serve the metrics server.
8082
metrics_rest_server_prefix
Prefix used for metric names specific to MLServer's REST inference interface.
rest_server
metrics_dir
Directory used to store internal metric files (used to support metrics sharing across inference workers). This is equivalent to Prometheus' $PROMETHEUS_MULTIPROC_DIR env var.
MLServer's current working directory (i.e. $PWD)
For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a environment.yml.
Note that these environments can also be created on the fly as we go, and then serialised later.
To illustrate the point, we will train a Scikit-Learn model using our older environment.
The first step will be to create and activate an environment which reflects what's outlined in our environment.yml file.
NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.
We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib.
You can find more details of this process in the Scikit-Learn example.
Lastly, we will need to serialise our environment in the format expected by MLServer. To do that, we will use a tool called conda-pack.
This tool, will save a portable version of our environment as a .tar.gz file, also known as tarball.
Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.
To do that, we will first need to select the right runtime through a model-settings.json config file.
We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.
Our Docker command will need to take into account the following points:
Mount the example's folder as a volume so that it can be accessed from within the container.
Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz.
Expose port 8080 so that we can send requests from the outside.
From the command line, this can be done using Docker's CLI as:
Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
Install `alibi` library for dataset dependencies and `alibi_detect` library for detector configuration from Pypi
```python
!pip install alibi alibi_detect
```import alibi
import matplotlib.pyplot as plt
import numpy as npadult = alibi.datasets.fetch_adult()
X, y = adult.data, adult.target
feature_names = adult.feature_names
category_map = adult.category_mapn_ref = 10000
n_test = 10000
X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
categories_per_feature = {f: None for f in list(category_map.keys())}from alibi_detect.cd import TabularDrift
cd_tabular = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)from alibi_detect.utils.saving import save_detector
filepath = "alibi-detector-artifacts"
save_detector(cd_tabular, filepath)preds = cd_tabular.predict(X_t0,drift_type="feature")
labels = ['No!', 'Yes!']
print(f"Threshold {preds['data']['threshold']}")
for f in range(cd_tabular.n_features):
fname = feature_names[f]
is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')Threshold 0.05
Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441%%writefile settings.json
{
"debug": "true"
}Overwriting settings.json%%writefile model-settings.json
{
"name": "income-tabular-drift",
"implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
"parameters": {
"uri": "./alibi-detector-artifacts",
"version": "v0.1.0",
"extra": {
"predict_parameters":{
"drift_type": "feature"
}
}
}
}Overwriting model-settings.jsonmlserver start .import requests
inference_request = {
"inputs": [
{
"name": "predict",
"shape": X_t0.shape,
"datatype": "FP32",
"data": X_t0.tolist(),
}
]
}
endpoint = "http://localhost:8080/v2/models/income-tabular-drift/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)import json
response_dict = json.loads(response.text)
labels = ['No!', 'Yes!']
for f in range(cd_tabular.n_features):
stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
fname = feature_names[f]
is_drift = response_dict['outputs'][0]['data'][f]
stat_val, p_val = response_dict['outputs'][1]['data'][f], response_dict['outputs'][2]['data'][f]
print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441By default, the runtime will look for a file called `model.[json | ubj | bst]`.
However, this can be modified through the `parameters.uri` field of your
{class}`ModelSettings <mlserver.settings.ModelSettings>` config (see the
section on [Model Settings](../../docs/reference/model-settings.md) for more
details).
```{code-block} json
---
emphasize-lines: 3-5
---
{
"name": "foo",
"parameters": {
"uri": "./my-own-model-filename.json"
}
}
```---
emphasize-lines: 10-12
---
{
"inputs": [
{
"name": "my-input",
"datatype": "INT32",
"shape": [2, 2],
"data": [1, 2, 3, 4]
}
],
"outputs": [
{ "name": "predict_proba" }
]
}# Original code and extra details can be found in:
# https://xgboost.readthedocs.io/en/latest/get_started.html#python
import os
import xgboost as xgb
import requests
from urllib.parse import urlparse
from sklearn.datasets import load_svmlight_file
TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
def _download_file(url: str) -> str:
parsed = urlparse(url)
file_name = os.path.basename(parsed.path)
file_path = os.path.join(os.getcwd(), file_name)
res = requests.get(url)
with open(file_path, 'wb') as file:
file.write(res.content)
return file_path
train_dataset_path = _download_file(TRAIN_DATASET_URL)
test_dataset_path = _download_file(TEST_DATASET_URL)
# NOTE: Workaround to load SVMLight files from the XGBoost example
X_train, y_train = load_svmlight_file(train_dataset_path)
X_test, y_test = load_svmlight_file(test_dataset_path)
X_train = X_train.toarray()
X_test = X_test.toarray()
# read in data
dtrain = xgb.DMatrix(data=X_train, label=y_train)
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
bstmodel_file_name = 'mushroom-xgboost.json'
bst.save_model(model_file_name)%%writefile settings.json
{
"debug": "true"
}%%writefile model-settings.json
{
"name": "mushroom-xgboost",
"implementation": "mlserver_xgboost.XGBoostModel",
"parameters": {
"uri": "./mushroom-xgboost.json",
"version": "v0.1.0"
}
}mlserver start .import requests
x_0 = X_test[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()y_test[0]import mlserver
from mlserver.types import InferenceRequest, InferenceResponse
class MyCustomRuntime(mlserver.MLModel):
async def load(self) -> bool:
self._model = load_my_custom_model()
mlserver.register("my_custom_metric", "This is a custom metric example")
return True
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
mlserver.log(my_custom_metric=34)
# TODO: Replace for custom logic to run inference
return self._model.predict(payload)# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
# The digits dataset
digits = datasets.load_digits()
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)import joblib
model_file_name = "mnist-svm.joblib"
joblib.dump(classifier, model_file_name)%%writefile settings.json
{
"debug": "true"
}%%writefile model-settings.json
{
"name": "mnist-svm",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"uri": "./mnist-svm.joblib",
"version": "v0.1.0"
}
}mlserver start .import requests
x_0 = X_test[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()y_test[0]%%writefile environment.yml
name: old-sklearn
channels:
- conda-forge
dependencies:
- python == 3.8
- scikit-learn == 0.24.2
- joblib == 0.17.0
- requests
- pip
- pip:
- mlserver == 1.1.0
- mlserver-sklearn == 1.1.0!conda env create --force -f environment.yml
!conda activate old-sklearn# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
# The digits dataset
digits = datasets.load_digits()
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)import joblib
model_file_name = "model.joblib"
joblib.dump(classifier, model_file_name)!conda pack --force -n old-sklearn -o old-sklearn.tar.gz%%writefile model-settings.json
{
"name": "mnist-svm",
"implementation": "mlserver_sklearn.SKLearnModel"
}docker run -it --rm \
-v "$PWD":/mnt/models \
-e "MLSERVER_ENV_TARBALL=/mnt/models/old-sklearn.tar.gz" \
-p 8080:8080 \
seldonio/mlserver:1.1.0-slimimport requests
x_0 = X_test[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mnist-svm/infer"
response = requests.post(endpoint, json=inference_request)
response.json()Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.
To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:
As you can see highlighted above, all that we need to specify is that:
Our inference deployment should use the V2 inference protocol, which is done by setting the protocol field to kfserving.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation field to SKLEARN_SERVER.
Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the V2 inference protocol), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are pre-determined by Seldon Core. However, it should also be possible to configure and add new ones (e.g. to support a custom MLServer runtime).
Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:
To consult the supported values of the implementation field where MLServer is used, you can check the support table below.
As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.
The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.
Scikit-Learn
SKLEARN_SERVER
XGBoost
XGBOOST_SERVER
MLflow
Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a wider set of pre-packaged servers. To check the full list, please visit the Seldon Core documentation.
There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.
The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:
As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:
Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol field to v2.
Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.
Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:
The mlserver package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.
In this example, we create a simple Identity Text Model which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.
The next step will be to serve our model using mlserver. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel.
This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:
split the text into words using the white space as the delimiter.
wait 0.5 seconds between each word to simulate a slow model.
return each word one by one.
As it can be seen, the predict_stream method receives as an input an AsyncIterator of InferenceRequest and returns an AsyncIterator of InferenceResponse. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.
Note that although unary-unary can be covered by predict_stream method as well, mlserver already covers that through the predict method.
One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.
The next step will be to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note the currently there are three main limitations of the streaming support in MLServer:
distributed workers are not supported (i.e., the parallel_workers setting should be set to 0)
gzip middleware is not supported for REST (i.e., gzip_enabled setting should be set to false)
Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be run from the same directory where our config files are or point to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.
To test our model, we will use the following inference request:
To send a REST streaming request to the server, we will use the following Python code:
To send a gRPC streaming request to the server, we will use the following Python code:
Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer method. The response is also an async generator which can be iterated over to get the response.
The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we create a simple Hello World JSON model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.
The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON model.
This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:
Parse the JSON input from the client
Create a JSON response echoing back the client request as well as a server generated message
The next step will be to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.
For simplicity in this case, we leverage the Python types that mlserver provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.
An open source inference server for your machine learning models.
MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with spec. Watch a quick video introducing the project .
Multi-model serving, letting users run multiple models within the same process.
MLServer has been built with in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.
This notebook shows an example of how you can leverage MMS with MLServer.
We will first start by training 2 different models:
Out of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.
In this example, we will showcase the integration with Kafka by serving a model thorugh Kafka.
We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.
You need to have Java installed in order for it to work correctly.
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: my-model
spec:
protocol: v2
predictors:
- name: default
graph:
name: classifier
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iriskubectl apply -f my-seldondeployment-manifest.yamlapiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: my-model
spec:
protocol: v2
predictors:
- name: default
graph:
name: classifier
componentSpecs:
- spec:
containers:
- name: classifier
image: my-custom-server:0.1.0kubectl apply -f my-seldondeployment-manifest.yamlMLFLOW_SERVER
Evaluate whether the codec can encode (decode) the payload.
Decode a request input into a high-level Python type.
Decode a response output into a high-level Python type.
Encode the given payload into a RequestInput.
Encode the given payload into a response output.
Exception.add_note(note) -- add a note to the exception
Exception.with_traceback(tb) -- set self.traceback to tb and return self.
Codec that convers to / from a datetime input.
Evaluate whether the codec can encode (decode) the payload.
Decode a request input into a high-level Python type.
Decode a response output into a high-level Python type.
Encode the given payload into a RequestInput.
Encode the given payload into a response output.
The InputCodec interface lets you define type conversions of your raw input data to / from the Open Inference Protocol. Note that this codec applies at the individual input (output) level.
For request-wide transformations (e.g. dataframes), use the RequestCodec interface instead.
Evaluate whether the codec can encode (decode) the payload.
Decode a request input into a high-level Python type.
Decode a response output into a high-level Python type.
Encode the given payload into a RequestInput.
Encode the given payload into a response output.
Decodes an request input (response output) as a NumPy array.
Evaluate whether the codec can encode (decode) the payload.
Decode a request input into a high-level Python type.
Decode a response output into a high-level Python type.
Encode the given payload into a RequestInput.
Encode the given payload into a response output.
Decodes the first input (output) of request (response) as a NumPy array. This codec can be useful for cases where the whole payload is a single NumPy tensor.
Evaluate whether the codec can encode (decode) the payload.
Decode an inference request into a high-level Python object.
Decode an inference response into a high-level Python object.
Encode the given payload into an inference request.
Encode the given payload into an inference response.
Decodes a request (response) into a Pandas DataFrame, assuming each input (output) head corresponds to a column of the DataFrame.
Evaluate whether the codec can encode (decode) the payload.
Decode an inference request into a high-level Python object.
Decode an inference response into a high-level Python object.
Encode the given payload into an inference request.
Encode the given payload into an inference response.
The RequestCodec interface lets you define request-level conversions between high-level Python types and the Open Inference Protocol. This can be useful where the encoding of your payload encompases multiple input heads (e.g. dataframes, where each column can be thought as a separate input head).
For individual input-level encoding / decoding, use the InputCodec interface instead.
Evaluate whether the codec can encode (decode) the payload.
Decode an inference request into a high-level Python object.
Decode an inference response into a high-level Python object.
Encode the given payload into an inference request.
Encode the given payload into an inference response.
Encodes a list of Python strings as a BYTES input (output).
Evaluate whether the codec can encode (decode) the payload.
Decode a request input into a high-level Python type.
Decode a response output into a high-level Python type.
Encode the given payload into a RequestInput.
Encode the given payload into a response output.
Decodes the first input (output) of request (response) as a list of strings. This codec can be useful for cases where the whole payload is a single list of strings.
Evaluate whether the codec can encode (decode) the payload.
Decode an inference request into a high-level Python object.
Decode an inference response into a high-level Python object.
Encode the given payload into an inference request.
Encode the given payload into an inference response.
No description available.
No description available.
No description available.
No description available.
No description available.
No description available.
No description available.
No description available.
No description available.
No description available.
Now you can just run it with the following command outside the terminal:
Now we can create the input and output topics required
The first step will be to train a simple scikit-learn model. For that, we will use the MNIST example from the scikit-learn documentation which trains an SVM model.
To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn documentation.
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note that, the settings.json file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.
Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input topic (which is the default input topic):
Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output by default.
As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.
%%writefile text_model.py
import asyncio
from typing import AsyncIterator
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import StringCodec
class TextModel(MLModel):
async def predict_stream(
self, payloads: AsyncIterator[InferenceRequest]
) -> AsyncIterator[InferenceResponse]:
payload = [_ async for _ in payloads][0]
text = StringCodec.decode_input(payload.inputs[0])[0]
words = text.split(" ")
split_text = []
for i, word in enumerate(words):
split_text.append(word if i == 0 else " " + word)
for word in split_text:
await asyncio.sleep(0.5)
yield InferenceResponse(
model_name=self._settings.name,
outputs=[
StringCodec.encode_output(
name="output",
payload=[word],
use_bytes=True,
),
],
)
%%writefile settings.json
{
"debug": false,
"parallel_workers": 0,
"gzip_enabled": false
}
%%writefile model-settings.json
{
"name": "text-model",
"implementation": "text_model.TextModel",
"versions": ["text-model/v1.2.3"],
"platform": "mlserver",
"inputs": [
{
"datatype": "BYTES",
"name": "prompt",
"shape": [1]
}
],
"outputs": [
{
"datatype": "BYTES",
"name": "output",
"shape": [1]
}
]
}mlserver start .%%writefile generate-request.json
{
"inputs": [
{
"name": "prompt",
"shape": [1],
"datatype": "BYTES",
"data": ["What is the capital of France?"],
"parameters": {
"content_type": "str"
}
}
],
"outputs": [
{
"name": "output"
}
]
}import httpx
from httpx_sse import connect_sse
from mlserver import types
from mlserver.codecs import StringCodec
inference_request = types.InferenceRequest.parse_file("./generate-request.json")
with httpx.Client() as client:
with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source:
for sse in event_source.iter_sse():
response = types.InferenceResponse.parse_raw(sse.data)
print(StringCodec.decode_output(response.outputs[0]))
import grpc
import mlserver.types as types
from mlserver.codecs import StringCodec
from mlserver.grpc.converters import ModelInferResponseConverter
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
inference_request = types.InferenceRequest.parse_file("./generate-request.json")
# need to convert from string to bytes for grpc
inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.root)
inference_request_g = converters.ModelInferRequestConverter.from_types(
inference_request, model_name="text-model", model_version=None
)
async def get_inference_request_stream(inference_request):
yield inference_request
async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel:
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
inference_request_stream = get_inference_request_stream(inference_request_g)
async for response in grpc_stub.ModelStreamInfer(inference_request_stream):
response = ModelInferResponseConverter.to_types(response)
print(StringCodec.decode_output(response.outputs[0]))%%writefile jsonmodels.py
import json
from typing import Dict, Any
from mlserver import MLModel, types
from mlserver.codecs import StringCodec
class JsonHelloWorldModel(MLModel):
async def load(self) -> bool:
# Perform additional custom initialization here.
print("Initialize model")
# Set readiness flag for model
return await super().load()
async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
request = self._extract_json(payload)
response = {
"request": request,
"server_response": "Got your request. Hello from the server.",
}
response_bytes = json.dumps(response).encode("UTF-8")
return types.InferenceResponse(
id=payload.id,
model_name=self.name,
model_version=self.version,
outputs=[
types.ResponseOutput(
name="echo_response",
shape=[len(response_bytes)],
datatype="BYTES",
data=[response_bytes],
parameters=types.Parameters(content_type="str"),
)
],
)
def _extract_json(self, payload: types.InferenceRequest) -> Dict[str, Any]:
inputs = {}
for inp in payload.inputs:
inputs[inp.name] = json.loads(
"".join(self.decode(inp, default_codec=StringCodec))
)
return inputs
%%writefile settings.json
{
"debug": "true"
}%%writefile model-settings.json
{
"name": "json-hello-world",
"implementation": "jsonmodels.JsonHelloWorldModel"
}mlserver start .import requests
import json
from mlserver.types import InferenceResponse
from mlserver.codecs.string import StringRequestCodec
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=1)
inputs = {"name": "Foo Bar", "message": "Hello from Client (REST)!"}
# NOTE: this uses characters rather than encoded bytes. It is recommended that you use the `mlserver` types to assist in the correct encoding.
inputs_string = json.dumps(inputs)
inference_request = {
"inputs": [
{
"name": "echo_request",
"shape": [len(inputs_string)],
"datatype": "BYTES",
"data": [inputs_string],
}
]
}
endpoint = "http://localhost:8080/v2/models/json-hello-world/infer"
response = requests.post(endpoint, json=inference_request)
print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = InferenceResponse.parse_raw(response.text)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)import requests
import json
import grpc
from mlserver.codecs.string import StringRequestCodec
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
import mlserver.types as types
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=1)
model_name = "json-hello-world"
inputs = {"name": "Foo Bar", "message": "Hello from Client (gRPC)!"}
inputs_bytes = json.dumps(inputs).encode("UTF-8")
inference_request = types.InferenceRequest(
inputs=[
types.RequestInput(
name="echo_request",
shape=[len(inputs_bytes)],
datatype="BYTES",
data=[inputs_bytes],
parameters=types.Parameters(content_type="str"),
)
]
)
inference_request_g = converters.ModelInferRequestConverter.from_types(
inference_request, model_name=model_name, model_version=None
)
grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)
print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = converters.ModelInferResponseConverter.to_types(response)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)can_encode(payload: Any) -> booldecode_input(request_input: RequestInput) -> List[bytes]decode_output(response_output: ResponseOutput) -> List[bytes]encode_input(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> RequestInputencode_output(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> ResponseOutputadd_note(...)with_traceback(...)can_encode(payload: Any) -> booldecode_input(request_input: RequestInput) -> List[datetime]decode_output(response_output: ResponseOutput) -> List[datetime]encode_input(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> RequestInputencode_output(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> ResponseOutputcan_encode(payload: Any) -> booldecode_input(request_input: RequestInput) -> Anydecode_output(response_output: ResponseOutput) -> Anyencode_input(name: str, payload: Any, kwargs) -> RequestInputencode_output(name: str, payload: Any, kwargs) -> ResponseOutputcan_encode(payload: Any) -> booldecode_input(request_input: RequestInput) -> ndarraydecode_output(response_output: ResponseOutput) -> ndarrayencode_input(name: str, payload: ndarray, kwargs) -> RequestInputencode_output(name: str, payload: ndarray, kwargs) -> ResponseOutputcan_encode(payload: Any) -> booldecode_request(request: InferenceRequest) -> Anydecode_response(response: InferenceResponse) -> Anyencode_request(payload: Any, kwargs) -> InferenceRequestencode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponsecan_encode(payload: Any) -> booldecode_request(request: InferenceRequest) -> DataFramedecode_response(response: InferenceResponse) -> DataFrameencode_outputs(payload: DataFrame, use_bytes: bool = True) -> List[ResponseOutput]encode_request(payload: DataFrame, use_bytes: bool = True, kwargs) -> InferenceRequestencode_response(model_name: str, payload: DataFrame, model_version: Optional[str] = None, use_bytes: bool = True, kwargs) -> InferenceResponsecan_encode(payload: Any) -> booldecode_request(request: InferenceRequest) -> Anydecode_response(response: InferenceResponse) -> Anyencode_request(payload: Any, kwargs) -> InferenceRequestencode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponsecan_encode(payload: Any) -> booldecode_input(request_input: RequestInput) -> List[str]decode_output(response_output: ResponseOutput) -> List[str]encode_input(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> RequestInputencode_output(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> ResponseOutputcan_encode(payload: Any) -> booldecode_request(request: InferenceRequest) -> Anydecode_response(response: InferenceResponse) -> Anyencode_request(payload: Any, kwargs) -> InferenceRequestencode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponsedecode_args(predict: Callable) -> Callable[[ForwardRef('MLModel'), <class 'mlserver.types.dataplane.InferenceRequest'>], Coroutine[Any, Any, InferenceResponse]]decode_inference_request(inference_request: InferenceRequest, model_settings: Optional[ModelSettings] = None, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]decode_request_input(request_input: RequestInput, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]encode_inference_response(payload: Any, model_settings: ModelSettings) -> Optional[InferenceResponse]encode_response_output(payload: Any, request_output: RequestOutput, metadata_outputs: Dict[str, MetadataTensor] = {}) -> Optional[ResponseOutput]get_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Anyget_decoded_or_raw(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Anyhas_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> boolregister_input_codec(CodecKlass: Union[type[InputCodec], InputCodec])register_request_codec(CodecKlass: Union[type[RequestCodec], RequestCodec])!wget https://apache.mirrors.nublue.co.uk/kafka/2.8.0/kafka_2.12-2.8.0.tgz
!tar -zxvf kafka_2.12-2.8.0.tgz
!./kafka_2.12-2.8.0/bin/kafka-storage.sh format -t OXn8RTSlQdmxwjhKnSB_6A -c ./kafka_2.12-2.8.0/config/kraft/server.properties!./kafka_2.12-2.8.0/bin/kafka-server-start.sh ./kafka_2.12-2.8.0/config/kraft/server.properties!./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-input --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
!./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-output --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
# The digits dataset
digits = datasets.load_digits()
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)import joblib
model_file_name = "mnist-svm.joblib"
joblib.dump(classifier, model_file_name)%%writefile settings.json
{
"debug": "true",
"kafka_enabled": "true"
}%%writefile model-settings.json
{
"name": "mnist-svm",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"uri": "./mnist-svm.joblib",
"version": "v0.1.0"
}
}mlserver start .import requests
x_0 = X_test[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()import json
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers="localhost:9092")
headers = {
"mlserver-model": b"mnist-svm",
"mlserver-version": b"v0.1.0",
}
producer.send(
"mlserver-input",
json.dumps(inference_request).encode("utf-8"),
headers=list(headers.items()))from kafka import KafkaConsumer
consumer = KafkaConsumer(
"mlserver-output",
bootstrap_servers="localhost:9092",
auto_offset_reset="earliest")
for msg in consumer:
print(f"key: {msg.key}")
print(f"value: {msg.value}\n")
breakAbility to run inference in parallel for vertical scaling across multiple models through a pool of inference workers.
Support for adaptive batching, to group inference requests together on the fly.
Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe (formerly known as KFServing), where MLServer is the core Python inference server used to serve machine learning models.
Support for the standard V2 Inference Protocol on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks.
You can read more about the goals of this project on the initial design document.
You can install the mlserver package running:
Note that to use any of the optional inference runtimes, you'll need to install the relevant package. For example, to serve a scikit-learn model, you would need to install the mlserver-sklearn package:
For further information on how to use MLServer, you can check any of the available examples.
Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. You can read more about inference runtimes in their documentation page.
Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common frameworks. This allows you to start serving models saved in these frameworks straight away. However, it's also possible to write custom runtimes.
Out of the box, MLServer provides support for:
Scikit-Learn
✅
XGBoost
✅
Spark MLlib
✅
LightGBM
✅
🔴 Unsupported
🟠 Deprecated: To be removed in a future version
🟢 Supported
🔵 Untested
3.7
🔴
3.8
🔴
3.9
🟢
3.10
🟢
3.11
🟢
3.12
🟢
To see MLServer in action, check out our full list of examples. You can find below a few selected examples showcasing how you can leverage MLServer to start serving your machine learning models.
Both the main mlserver package and the inference runtimes packages try to follow the same versioning schema. To bump the version across all of them, you can use the ./hack/update-version.sh script.
We generally keep the version as a placeholder for an upcoming version.
For example:
To run all of the tests for MLServer and the runtimes, use:
To run run tests for a single file, use something like:
mnist-svm
scikit-learn
./models/mnist-svm/model.joblib
mushroom-xgboost
xgboost
./models/mushroom-xgboost/model.json
The next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json file local to each of our models and a server-wide settings.json. That is,
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
models/mnist-svm/model-settings.json: holds the configuration specific to our mnist-svm model (e.g. input type, runtime to use, etc.).
models/mushroom-xgboost/model-settings.json: holds the configuration specific to our mushroom-xgboost model (e.g. input type, runtime to use, etc.).
Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.
For that, we can use the Python types that the mlserver package provides out of box, or we can build our request manually.
The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we will train a numpyro model. The numpyro library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.
Out of the box, mlserver doesn't provide an inference runtime for numpyro. However, through this example we will see how easy is to develop our own.
The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the .
Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.
Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.
This will get saved in a numpyro-divorce.json file.
The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro model.
Our custom inference wrapper should be responsible of:
Loading the model from the set samples we saved previously.
Running inference using our model structure, and the posterior approximated from the samples.
The next step will be to create 2 configuration files:
settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.jsonmodel-settings.jsonNow that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.
MLServer will automatically find your requirements.txt file and install necessary python packages
MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build subcommand to create an image, which we'll be able to deploy later.
Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl.
To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:
As we should be able to see, the server running within our Docker image responds as expected.
Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running .
For that, we will need to create a SeldonDeployment resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the . This can be achieved by applying (i.e. kubectl apply) a SeldonDeployment manifest to the cluster, similar to the one below:
There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this which walks through the process of writing a custom runtime.
MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:
pip install mlserverpip install mlserver-sklearn./hack/update-version.sh 0.2.0.dev1make testtox -e py3 -- tests/batch_processing/test_rest.py# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
# The digits dataset
digits = datasets.load_digits()
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# Split data into train and test subsets
X_train, X_test_digits, y_train, y_test_digits = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)import joblib
import os
mnist_svm_path = os.path.join("models", "mnist-svm")
os.makedirs(mnist_svm_path, exist_ok=True)
mnist_svm_model_path = os.path.join(mnist_svm_path, "model.joblib")
joblib.dump(classifier, mnist_svm_model_path)# Original code and extra details can be found in:
# https://xgboost.readthedocs.io/en/latest/get_started.html#python
import os
import xgboost as xgb
import requests
from urllib.parse import urlparse
from sklearn.datasets import load_svmlight_file
TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
def _download_file(url: str) -> str:
parsed = urlparse(url)
file_name = os.path.basename(parsed.path)
file_path = os.path.join(os.getcwd(), file_name)
res = requests.get(url)
with open(file_path, 'wb') as file:
file.write(res.content)
return file_path
train_dataset_path = _download_file(TRAIN_DATASET_URL)
test_dataset_path = _download_file(TEST_DATASET_URL)
# NOTE: Workaround to load SVMLight files from the XGBoost example
X_train, y_train = load_svmlight_file(train_dataset_path)
X_test_agar, y_test_agar = load_svmlight_file(test_dataset_path)
X_train = X_train.toarray()
X_test_agar = X_test_agar.toarray()
# read in data
dtrain = xgb.DMatrix(data=X_train, label=y_train)
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
bstimport os
mushroom_xgboost_path = os.path.join("models", "mushroom-xgboost")
os.makedirs(mushroom_xgboost_path, exist_ok=True)
mushroom_xgboost_model_path = os.path.join(mushroom_xgboost_path, "model.json")
bst.save_model(mushroom_xgboost_model_path)%%writefile settings.json
{
"debug": "true"
}%%writefile models/mnist-svm/model-settings.json
{
"name": "mnist-svm",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"version": "v0.1.0"
}
}%%writefile models/mushroom-xgboost/model-settings.json
{
"name": "mushroom-xgboost",
"implementation": "mlserver_xgboost.XGBoostModel",
"parameters": {
"version": "v0.1.0"
}
}
mlserver start .import requests
x_0 = X_test_digits[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()import requests
x_0 = X_test_agar[0:1]
inference_request = {
"inputs": [
{
"name": "predict",
"shape": x_0.shape,
"datatype": "FP32",
"data": x_0.tolist()
}
]
}
endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)
response.json()debug
bool
True
-
environments_dir
str
'-'
-
extensions
List[str]
[]
-
grpc_max_message_length
Optional[int]
None
-
grpc_port
int
8081
-
gzip_enabled
bool
True
Enable GZipMiddleware.
host
str
'0.0.0.0'
-
http_port
int
8080
-
kafka_enabled
bool
False
Enable Kafka integration for the server.
kafka_servers
str
'localhost:9092'
Comma-separated list of Kafka servers.
kafka_topic_input
str
'mlserver-input'
Kafka topic for input messages.
kafka_topic_output
str
'mlserver-output'
Kafka topic for output messages.
load_models_at_startup
bool
True
-
logging_settings
Union[str, Dict[Any, Any], None]
None
Path to logging config file or dictionary configuration.
metrics_dir
str
'-'
Directory used to share metrics across parallel workers. Equivalent to the PROMETHEUS_MULTIPROC_DIR env var in prometheus-client. Note that this won't be used if the parallel_workers flag is disabled. By default, the .metrics folder of the current working directory will be used.
metrics_endpoint
Optional[str]
'/metrics'
Endpoint used to expose Prometheus metrics. Alternatively, can be set to None to disable it.
metrics_port
int
8082
Port used to expose metrics endpoint.
metrics_rest_server_prefix
str
'rest_server'
Metrics rest server string prefix to be exported.
model_repository_implementation
Optional[ImportString]
None
-
model_repository_implementation_args
dict
{}
-
model_repository_root
str
'.'
-
parallel_workers
int
1
-
parallel_workers_timeout
int
5
-
root_path
str
''
-
server_name
str
'mlserver'
-
server_version
str
'1.7.0.dev0'
-
tracing_server
Optional[str]
None
Server name used to export OpenTelemetry tracing to collector service.
use_structured_logging
bool
False
Use JSON-formatted structured logging instead of default format.
env_prefix
str
"MLSERVER_"
env_file
str
".env"
protected_namespaces
tuple
()
cache_enabled
bool
False
Enable caching for the model predictions.
cache_size
int
100
Cache size to be used if caching is enabled.
cors_settings
Optional[CORSSettings]
None
-
# Original source code and more details can be found in:
# https://nbviewer.jupyter.org/github/pyro-ppl/numpyro/blob/master/notebooks/source/bayesian_regression.ipynb
import numpyro
import numpy as np
import pandas as pd
from numpyro import distributions as dist
from jax import random
from numpyro.infer import MCMC, NUTS
DATASET_URL = "https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/WaffleDivorce.csv"
dset = pd.read_csv(DATASET_URL, sep=";")
standardize = lambda x: (x - x.mean()) / x.std()
dset["AgeScaled"] = dset.MedianAgeMarriage.pipe(standardize)
dset["MarriageScaled"] = dset.Marriage.pipe(standardize)
dset["DivorceScaled"] = dset.Divorce.pipe(standardize)
def model(marriage=None, age=None, divorce=None):
a = numpyro.sample("a", dist.Normal(0.0, 0.2))
M, A = 0.0, 0.0
if marriage is not None:
bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
M = bM * marriage
if age is not None:
bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
A = bA * age
sigma = numpyro.sample("sigma", dist.Exponential(1.0))
mu = a + M + A
numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
# Start from this source of randomness. We will split keys for subsequent operations.
rng_key = random.PRNGKey(0)
rng_key, rng_key_ = random.split(rng_key)
num_warmup, num_samples = 1000, 2000
# Run NUTS.
kernel = NUTS(model)
mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
mcmc.run(
rng_key_, marriage=dset.MarriageScaled.values, divorce=dset.DivorceScaled.values
)
mcmc.print_summary()import json
samples = mcmc.get_samples()
serialisable = {}
for k, v in samples.items():
serialisable[k] = np.asarray(v).tolist()
model_file_name = "numpyro-divorce.json"
with open(model_file_name, "w") as model_file:
json.dump(serialisable, model_file)# %load models.py
import json
import numpyro
import numpy as np
from jax import random
from mlserver import MLModel
from mlserver.codecs import decode_args
from mlserver.utils import get_model_uri
from numpyro.infer import Predictive
from numpyro import distributions as dist
from typing import Optional
class NumpyroModel(MLModel):
async def load(self) -> bool:
model_uri = await get_model_uri(self._settings)
with open(model_uri) as model_file:
raw_samples = json.load(model_file)
self._samples = {}
for k, v in raw_samples.items():
self._samples[k] = np.array(v)
self._predictive = Predictive(self._model, self._samples)
return True
@decode_args
async def predict(
self,
marriage: Optional[np.ndarray] = None,
age: Optional[np.ndarray] = None,
divorce: Optional[np.ndarray] = None,
) -> np.ndarray:
predictions = self._predictive(
rng_key=random.PRNGKey(0), marriage=marriage, age=age, divorce=divorce
)
obs = predictions["obs"]
obs_mean = obs.mean()
return np.asarray(obs_mean)
def _model(self, marriage=None, age=None, divorce=None):
a = numpyro.sample("a", dist.Normal(0.0, 0.2))
M, A = 0.0, 0.0
if marriage is not None:
bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
M = bM * marriage
if age is not None:
bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
A = bA * age
sigma = numpyro.sample("sigma", dist.Exponential(1.0))
mu = a + M + A
numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
# %load settings.json
{
"debug": "true"
}
# %load model-settings.json
{
"name": "numpyro-divorce",
"implementation": "models.NumpyroModel",
"parameters": {
"uri": "./numpyro-divorce.json"
}
}
mlserver start .import requests
import numpy as np
from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec
x_0 = np.array([28.0])
inference_request = InferenceRequest(
inputs=[
NumpyCodec.encode_input(name="marriage", payload=x_0)
]
)
endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
response = requests.post(endpoint, json=inference_request.model_dump())
response.json()# %load requirements.txt
numpy==1.22.4
numpyro==0.8.0
jax==0.2.24
jaxlib==0.3.7
This section expects that Docker is available and running in the background.%%bash
mlserver build . -t 'my-custom-numpyro-server:0.1.0'docker run -it --rm -p 8080:8080 my-custom-numpyro-server:0.1.0import numpy as np
from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec
x_0 = np.array([28.0])
inference_request = InferenceRequest(
inputs=[
NumpyCodec.encode_input(name="marriage", payload=x_0)
]
)
endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
response = requests.post(endpoint, json=inference_request.model_dump())
response.json()This section expects access to a functional Kubernetes cluster with Seldon Core installed and some familiarity with `kubectl`.Also consider that depending on your Kubernetes installation Seldon Core might expect to get the container image from a public container registry like [Docker hub](https://hub.docker.com/) or [Google Container Registry](https://cloud.google.com/container-registry). For that you need to do an extra step of pushing the container to the registry using `docker tag <image name> <container registry>/<image name>` and `docker push <container registry>/<image name>` and also updating the `image` section of the yaml file to `<container registry>/<image name>`.%%writefile seldondeployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: numpyro-model
spec:
protocol: v2
predictors:
- name: default
graph:
name: numpyro-divorce
type: MODEL
componentSpecs:
- spec:
containers:
- name: numpyro-divorce
image: my-custom-numpyro-server:0.1.0load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.
Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.
MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.
Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through MLServer's codecs and content types system.
As an example of the above, let's assume a model which
Takes two lists of strings as inputs:
questions, containing multiple questions to ask our model.
context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.
Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:
Note that, the method signature of our predict method now specifies:
The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).
There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.
Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.
MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.
For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:
Note that, from the example above, we are assuming that:
Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).
More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.
It is possible to load this custom set of dependencies by providing them through an environment tarball, whose path can be specified within your model-settings.json file.
To load a custom environment, parallel inference must be enabled.
The main MLServer process communicates with workers in custom environments via multiprocessing.Queue using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.
🔴
Unsupported
🟢
Supported
🔵
Untested
3.9
🟢
🟢
🔵
3.10
🟢
🟢
🔵
3.11
🔵
🔵
If we take the previous example above as a reference, we could extend it to include our custom environment as:
Note that, in the folder layout above, we are assuming that:
The environment.tar.gz tarball contains a pre-packaged version of your custom environment.
The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).
MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a Conda environment file or a requirements.txt file.
To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:
The output will be a Docker image named my-custom-server, ready to be used.
The mlserver build subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.
The mlserver build subcommand will treat any settings.json or model-settings.json files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.
Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like
Supporting arbitrary user IDs.
Building your base custom environment on the fly.
Configure a set of default setting values.
However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like Kaniko or Buildah).
To account for these cases, MLServer also includes a mlserver dockerfile subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.
CatBoost
✅
Tempo
✅
MLflow
✅
Alibi-Detect
✅
Alibi-Explain
✅
HuggingFace
✅
3.13
🔴
This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:
Running the model locally
Turning the ML model into an API
Containerizing the model
Storing the container in a registry
Deploying the model to Kubernetes (with Seldon Core)
Scaling the model
The tutorial comes with an accompanying video which you might find useful as you work through the steps:
The slides used in the video can be found .
For this tutorial, we're going to use the available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).
We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the .
The easiest way to run this example is to clone the repository located :
If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava.
Once you've done that, you can just run:
And it'll set you up with all the libraries required to run the code.
The starting point for this tutorial is python script app.py. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:
First up, we're importing a couple of functions from our helpers.py file:
plot provides the visualisation of the samples, labels and predictions.
preprocess is used to resize images to 224x224 pixels and normalize the RGB values.
The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.
Try it yourself by running:
Here's what our setup currently looks like:
The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.
Typically people turn to popular python web servers like or . This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source framework.
MLServer supports a bunch of out of the box, but it also supports which is what we'll use for our Tensorflow model.
In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let's take a look at the code (found in model/serve-model.py):
The load() method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model. The predict() method is where we include all of our prediction logic.
You may notice that we've slightly modified our code from earlier (in app.py). The biggest change is that it is now wrapped in a single class CassavaModel.
The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:
This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).
We're now ready to serve our model with MLServer. To do that we can simply run:
MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.
Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py file by running:
Our setup has now evloved and looks like this:
are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.
Note: you will need installed to run this section of the tutorial. You'll also need a account or another container registry.
Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.
Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we'll use for this example is stored in model/requirements.txt:
Notice that we didn't need to include
mlserverin our requirements? That's because the builder image has mlserver included already.
We're now ready to build our container image using:
Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
MLServer will now build the model into a container image for us. We can check the output of this by running:
Finally, we want to send this container image to be stored in our container registry. We can do this by running:
Our setup now looks like this. Where our model has been packaged and sent to a container registry:
Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.
We're going to use a popular open source framework called to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from but it also adds machine-learning specific features.
This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the and get set up first. You'll also need to install the kubectl command line interface.
To create our deployment with Seldon Core we need to create a small configuration file that looks like this:
You can find this file named deployment.yaml in the base folder of this tutorial's repository.
Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:
To check our deployment is up and running we can run:
We should see STATUS = Running once our deployment has finalized.
Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.
To do this, we simply run the test.py file in the following way:
This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.
A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you before sending requests. If your cluster is remote, you'll need to change the inference_url variable on line 21 of test.py.
Having deployed our model to kubernetes and tested it, our setup now looks like this:
Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model .
Kubernetes and Seldon Core make this really easy to do by simply running:
We can replace the --replicas=3 with any number we want to scale to.
To watch the servers scaling out we can run:
Once the new replicas have finished rolling out, our setup now looks like this:
In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up to make sure our prediction API is always online and performing as expected.
Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:
Loading of Transformer Model artifacts from the Hugging Face Hub.
Model quantization & optimization using the Hugging Face Optimum library
Request batching for GPU optimization (via adaptive batching and request batching)
In this example, we will showcase some of this features using an example model.
{
"model": "sum-model",
"implementation": "models.MyCustomRuntime"
}{
"model": "sum-model",
"implementation": "models.MyCustomRuntime",
"parameters": {
"environment_tarball": "./environment.tar.gz"
}
}DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
class MyCustomRuntime(MLModel):
async def load(self) -> bool:
# TODO: Replace for custom logic to load a model artifact
self._model = load_my_custom_model()
return True
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
# TODO: Replace for custom logic to run inference
return self._model.predict(payload)from mlserver import MLModel
from mlserver.codecs import decode_args
from typing import List
class MyCustomRuntime(MLModel):
async def load(self) -> bool:
# TODO: Replace for custom logic to load a model artifact
self._model = load_my_custom_model()
return True
@decode_args
async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
# TODO: Replace for custom logic to run inference
return self._model.predict(questions, context)from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
class CustomHeadersRuntime(MLModel):
...
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
if payload.parameters and payload.parametes.headers:
# These are all the incoming HTTP headers / gRPC metadata
print(payload.parameters.headers)
...from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
class CustomHeadersRuntime(MLModel):
...
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
...
return InferenceResponse(
# Include any actual outputs from inference
outputs=[],
parameters=Parameters(headers={"foo": "bar"})
).
└── models
└── sum-model
├── model-settings.json
├── models.py.
└── models
└── sum-model
├── environment.tar.gz
├── model-settings.json
├── models.pymlserver build . -t my-custom-server🔵

Since we're using a pretrained model, we can skip straight to serving.
Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We can also leverage the Optimum library that allows us to access quantized and optimized models.
We can download pretrained optimized models from the hub if available by enabling the optimum_model flag:
Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
The request can now be sent using the same request structure but using optimized models for better performance.
We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.
Once again, you are able to run the model using the MLServer CLI.
Once again, you are able to run the model using the MLServer CLI.
We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters
We first test the time taken with the device=-1 which configures CPU by default
Once again, you are able to run the model using the MLServer CLI.
We can see that it takes 81 seconds which is 8 times longer than the gpu example below.
IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.
Now we'll run the benchmark with GPU configured, which we can do by setting device=0
We can see that the elapsed time is 8 times less than the CPU version!
We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.
In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.
We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.
In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta which performs load testing.
We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.
The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.
The training script will also serialise our trained model, leveraging the MLflow Model format. By default, we should be able to find the saved artifact under the mlruns folder.
Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json that instructs MLServer to load our artifact using the MLflow Inference Runtime.
Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.
Note that, the request specifies the value pd as its content type, whereas every input specifies the content type np. These parameters will instruct MLServer to:
Convert every input value to a NumPy array, using the data type and shape information provided.
Group all the different inputs into a Pandas DataFrame, using their names as the column names.
To learn more about how MLServer uses content type parameters, you can check this worked out example.
As we can see in the output above, the predicted quality score for our input wine was 5.57.
MLflow currently ships with an scoring server with its own protocol. In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations endpoint.
As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.
As we can see above, the predicted quality for our input is 5.57, matching the prediction we obtained above.
MLflow lets users define a model signature, where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the V2 inference protocol employed by MLServer defines a metadata endpoint which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.
To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra content type that is required to correctly decode / encode your data.
As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel file saved by our model.
We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/ endpoint.
As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.
git clone https://github.com/SeldonIO/cassava-example.gitcd cassava-example/pip install -r requirements.txtfrom helpers import plot, preprocess
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
# Fixes an issue with Jax and TF competing for GPU
tf.config.experimental.set_visible_devices([], 'GPU')
# Load the model
model_path = './model'
classifier = hub.KerasLayer(model_path)
# Load the dataset and store the class names
dataset, info = tfds.load('cassava', with_info=True)
class_names = info.features['label'].names + ['unknown']
# Select a batch of examples and plot them
batch_size = 9
batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
examples = next(batch)
plot(examples, class_names)
# Generate predictions for the batch and plot them against their labels
predictions = classifier(examples['image'])
predictions_max = tf.argmax(predictions, axis=-1)
print(predictions_max)
plot(examples, class_names, predictions_max)python app.pyfrom mlserver import MLModel
from mlserver.codecs import decode_args
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
# Define a class for our Model, inheriting the MLModel class from MLServer
class CassavaModel(MLModel):
# Load the model into memory
async def load(self) -> bool:
tf.config.experimental.set_visible_devices([], 'GPU')
model_path = '.'
self._model = hub.KerasLayer(model_path)
self.ready = True
return self.ready
# Logic for making predictions against our model
@decode_args
async def predict(self, payload: np.ndarray) -> np.ndarray:
# convert payload to tf.tensor
payload_tensor = tf.constant(payload)
# Make predictions
predictions = self._model(payload_tensor)
predictions_max = tf.argmax(predictions, axis=-1)
# convert predictions to np.ndarray
response_data = np.array(predictions_max)
return response_data{
"name": "cassava",
"implementation": "serve-model.CassavaModel"
}mlserver start model/python test.py --localtensorflow==2.12.0
tensorflow-hub==0.13.0mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]docker imagesdocker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: cassava
spec:
protocol: v2
predictors:
- componentSpecs:
- spec:
containers:
- image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
name: cassava
imagePullPolicy: Always
graph:
name: cassava
type: MODEL
name: cassavakubectl create -f deployment.yamlkubectl get podspython test.py --remotekubectl scale sdep cassava --replicas=3kubectl get pods --watch# Import required dependencies
import requests%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "text-generation",
"pretrained_model": "distilgpt2"
}
}
}Overwriting ./model-settings.jsonmlserver start .inference_request = {
"inputs": [
{
"name": "args",
"shape": [1],
"datatype": "BYTES",
"data": ["this is a test"],
}
]
}
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json(){'model_name': 'transformer',
'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'hg_jsonlist'},
'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "text-generation",
"pretrained_model": "distilgpt2",
"optimum_model": true
}
}
}Overwriting ./model-settings.jsonmlserver start .inference_request = {
"inputs": [
{
"name": "args",
"shape": [1],
"datatype": "BYTES",
"data": ["this is a test"],
}
]
}
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json(){'model_name': 'transformer',
'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'hg_jsonlist'},
'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "question-answering"
}
}
}Overwriting ./model-settings.jsonmlserver start .inference_request = {
"inputs": [
{
"name": "question",
"shape": [1],
"datatype": "BYTES",
"data": ["what is your name?"],
},
{
"name": "context",
"shape": [1],
"datatype": "BYTES",
"data": ["Hello, I am Seldon, how is it going"],
},
]
}
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json(){'model_name': 'transformer',
'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'hg_jsonlist'},
'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "text-classification"
}
}
}Overwriting ./model-settings.jsonmlserver start .inference_request = {
"inputs": [
{
"name": "args",
"shape": [1],
"datatype": "BYTES",
"data": ["This is terrible!"],
}
]
}
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json(){'model_name': 'transformer',
'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'hg_jsonlist'},
'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"max_batch_size": 128,
"max_batch_time": 1,
"parameters": {
"extra": {
"task": "text-generation",
"device": -1
}
}
}Overwriting ./model-settings.jsonmlserver start .inference_request = {
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["This is a generation for the work" for i in range(512)],
}
]
}
# Benchmark time
import time
start_time = time.monotonic()
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
)
print(f"Elapsed time: {time.monotonic() - start_time}")Elapsed time: 66.42268538899953%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "text-generation",
"device": 0
}
}
}Overwriting ./model-settings.jsoninference_request = {
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["This is a generation for the work" for i in range(512)],
}
]
}
# Benchmark time
import time
start_time = time.monotonic()
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
)
print(f"Elapsed time: {time.monotonic() - start_time}")Elapsed time: 11.27933280000434%%writefile ./model-settings.json
{
"name": "transformer",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"max_batch_size": 128,
"max_batch_time": 1,
"parameters": {
"extra": {
"task": "text-generation",
"pretrained_model": "distilgpt2",
"device": 0
}
}
}Overwriting ./model-settings.json%%bash
jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
| vegeta \
-cpus="2" \
attack \
-duration="3s" \
-rate="50" \
-format=json \
| vegeta \
report \
-type=textRequests [total, rate, throughput] 150, 50.34, 22.28
Duration [total, attack, wait] 6.732s, 2.98s, 3.753s
Latencies [min, mean, 50, 90, 95, 99, max] 1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
Bytes In [total, mean] 60978, 406.52
Bytes Out [total, mean] 12300, 82.00
Success [ratio] 100.00%
Status Codes [code:count] 200:150
Error Set:from IPython.core.magic import register_line_cell_magic
@register_line_cell_magic
def writetemplate(line, cell):
with open(line, 'w') as f:
f.write(cell.format(**globals()))# %load src/train.py
# Original source code and more details can be found in:
# https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html
# The data set used in this example is from
# http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties.
# In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
import warnings
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)
def eval_metrics(actual, pred):
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2
if __name__ == "__main__":
warnings.filterwarnings("ignore")
np.random.seed(40)
# Read the wine-quality csv file from the URL
csv_url = (
"http://archive.ics.uci.edu/ml"
"/machine-learning-databases/wine-quality/winequality-red.csv"
)
try:
data = pd.read_csv(csv_url, sep=";")
except Exception as e:
logger.exception(
"Unable to download training & test CSV, "
"check your internet connection. Error: %s",
e,
)
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)
# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]
alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
with mlflow.start_run():
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)
tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
model_signature = infer_signature(train_x, train_y)
# Model registry does not work with file store
if tracking_url_type_store != "file":
# Register the model
# There are other ways to use the Model Registry,
# which depends on the use case,
# please refer to the doc for more information:
# https://mlflow.org/docs/latest/model-registry.html#api-workflow
mlflow.sklearn.log_model(
lr,
"model",
registered_model_name="ElasticnetWineModel",
signature=model_signature,
)
else:
mlflow.sklearn.log_model(lr, "model", signature=model_signature)
!python src/train.pyimport os
[experiment_file_path] = !ls -td ./mlruns/0/* | head -1
model_path = os.path.join(experiment_file_path, "artifacts", "model")
print(model_path)!ls {model_path} %%writetemplate ./model-settings.json
{{
"name": "wine-classifier",
"implementation": "mlserver_mlflow.MLflowRuntime",
"parameters": {{
"uri": "{model_path}"
}}
}}mlserver start .import requests
inference_request = {
"inputs": [
{
"name": "fixed acidity",
"shape": [1],
"datatype": "FP32",
"data": [7.4],
},
{
"name": "volatile acidity",
"shape": [1],
"datatype": "FP32",
"data": [0.7000],
},
{
"name": "citric acid",
"shape": [1],
"datatype": "FP32",
"data": [0],
},
{
"name": "residual sugar",
"shape": [1],
"datatype": "FP32",
"data": [1.9],
},
{
"name": "chlorides",
"shape": [1],
"datatype": "FP32",
"data": [0.076],
},
{
"name": "free sulfur dioxide",
"shape": [1],
"datatype": "FP32",
"data": [11],
},
{
"name": "total sulfur dioxide",
"shape": [1],
"datatype": "FP32",
"data": [34],
},
{
"name": "density",
"shape": [1],
"datatype": "FP32",
"data": [0.9978],
},
{
"name": "pH",
"shape": [1],
"datatype": "FP32",
"data": [3.51],
},
{
"name": "sulphates",
"shape": [1],
"datatype": "FP32",
"data": [0.56],
},
{
"name": "alcohol",
"shape": [1],
"datatype": "FP32",
"data": [9.4],
},
]
}
endpoint = "http://localhost:8080/v2/models/wine-classifier/infer"
response = requests.post(endpoint, json=inference_request)
response.json()import requests
inference_request = {
"dataframe_split": {
"columns": [
"fixed acidity",
"volatile acidity",
"citric acid",
"residual sugar",
"chlorides",
"free sulfur dioxide",
"total sulfur dioxide",
"density",
"pH",
"sulphates",
"alcohol",
],
"data": [[7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4]]
}
}
endpoint = "http://localhost:8080/invocations"
response = requests.post(endpoint, json=inference_request)
response.json()!cat {model_path}/MLmodelimport requests
endpoint = "http://localhost:8080/v2/models/wine-classifier"
response = requests.get(endpoint)
response.json()This guide will help you get started creating machine learning microservices with MLServer in less than 30 minutes. Our use case will be to create a service that helps us compare the similarity between two documents. Think about whenever you are comparing a book, news article, blog post, or tutorial to read next, wouldn't it be great to have a way to compare with similar ones that you have already read and liked (without having to rely on a recommendation's system)? That's what we'll focus on this guide, on creating a document similarity service. 📜 + 📃 = 😎👌🔥
The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.
MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.
The first step is to install mlserver, the spacy library, and the language model spacy will need for our use case. We will also download the wikipedia-api library to test our use case with a few fun summaries.
If you've never heard of spaCy before, it is an open-source Python library for advanced natural language processing that excels at large-scale information extraction and retrieval tasks, among many others. The model we'll use is a pre-trained model on English text from the web. This model will help us get started with our use case faster than if we had to train a model from scratch for our use case.
Let's first install these libraries.
We will also need to download the language model separately once we have spaCy inside our virtual environment.
If you're going over this guide inside a notebook, don't forget to add an exclamation mark ! in front of the two commands above. If you are in VSCode, you can keep them as they are and change the cell type to bash.
At its core, MLServer requires that users give it 3 things, a model-settings.json file with information about the model, an (optional) settings.json file with information related to the server you are about to set up, and a .py file with the load-predict recipe for your model (as shown in the picture above).
Let's create a directory for our model.
Before we create a service that allows us to compare the similarity between two documents, it is good practice to first test that our solution works first, especially if we're using a pre-trained model and/or a pipeline.
Now that we have our model loaded, let's look at the similarity of the abstracts of Barbieheimer using the wikipedia-api Python library. The main requirement of the API is that we pass into the main class, Wikipedia(), a project name, an email and the language we want information to be returned in. After that, we can search the for the movie summaries we want by passing the title of the movie to the .page() method and accessing the summary of it with the .summary attribute.
Feel free to change the movies for other topics you might be interested in.
You can run the following lines inside a notebook or, conversely, add them to a app.py file.
If you created an app.py file with the code above, make sure you run python app.py from the terminal.
Now that we have our two summaries, let's compare them using spacy.
Notice that both summaries have information about the other movie, about "films" in general, and about the dates each aired on (which is the same). The reality is that, the model hasn't seen any of these movies so it might be generalizing to the context of each article, "movies," rather than their content, "dolls as humans and the atomic bomb."
You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.
Time to create a machine learning API for our use-case. 😎
MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.
To create a service with MLServer, we will define a class with two asynchronous functions, one that loads the model and another one to run inference (or predict) with. The former will load the spacy model we tested in the last section, and the latter will take in a list with the two documents we want to compare. Lastly, our function will return a numpy array with a single value, our similarity score. We'll write the file to our similarity_model directory and call it my_model.py.
Now that we have our model file ready to go, the last piece of the puzzle is to tell MLServer a bit of info about it. In particular, it wants (or needs) to know the name of the model and how to implement it. The former can be anything you want (and it will be part of the URL of your API), and the latter will follow the recipe of name_of_py_file_with_your_model.class_with_your_model.
Let's create the model-settings.json file MLServer is expecting inside our similarity_model directory and add the name and the implementation of our model to it.
Now that everything is in place, we can start serving predictions locally to test how things would play out for our future users. We'll initiate our server via the command line, and later on we'll see how to do the same via Python files. Here's where we are at right now in the process of developing microservices with MLServer.
As you can see in the image, our server will be initialized with three entry points, one for HTTP requests, another for gRPC, and another for the metrics. To learn more about the powerful metrics feature of MLServer, please visit the relevant docs page here. To learn more about gRPC, please see this tutorial here.
To start our service, open up a terminal and run the following command.
Note: If this is a fresh terminal, make sure you activate your environment before you run the command above. If you run the command above from your notebook (e.g. !mlserver start similarity_model/), you will have to send the request below from another notebook or terminal since the cell will continue to run until you turn it off.
Time to become a client of our service and test it. For this, we'll set up the payload we'll send to our service and use the requests library to POST our request.
Please note that the request below uses the variables we created earlier with the summaries of Barbie and Oppenheimer. If you are sending this POST request from a fresh python file, make sure you move those lines of code above into your request file.
Let's decompose what just happened.
The URL for our service might seem a bit odd if you've never heard of the V2/Open Inference Protocol (OIP). This protocol is a set of specifications that allows machine learning models to be shared and deployed in a standardized way. This protocol enables the use of machine learning models on a variety of platforms and devices without requiring changes to the model or its code. The OIP is useful because it allows us to integrate machine learning into a wide range of applications in a standard way.
All URLs you create with MLServer will have the following structure.
This kind of protocol is a standard adopted by different companies like NVIDIA, Tensorflow Serving, KServe, and others, to keep everyone on the same page. If you think about driving cars globally, your country has to apply a standard for driving on a particular side of the road, and this ensures you and everyone else stays on the left (or the right depending on where you are at). Adopting this means that you won't have to wonder where the next driver is going to come out of when you are driving and are about to take a turn, instead, you can focus on getting to where you're going to without much worrying.
Let's describe what each of the components of our inference_request does.
name: this maps one-to-one to the name of the parameter in your predict() function.
shape: represents the shape of the elements in our data. In our case, it is a list with [2] strings.
datatype: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.
parameters: allows us to specify the content_type beyond the data types
data: the inputs to our predict function.
To learn more about the OIP and how MLServer content types work, please have a looks at their docs page here.
Say you need to meet the demand of a high number of users and one model might not be enough, or is not using all of the resources of the virtual machine instance it was allocated to. What we can do in this case is to create multiple replicas of our model to increase the throughput of the requests that come in. This can be particularly useful at the peak times of our server. To do this, we need to tweak the settings of our server via the settings.json file. In it, we'll add the number of independent models we want to have to the parameter "parallel_workers": 3.
Let's stop our server, change the settings of it, start it again, and test it.
As you can see in the output of the terminal in the picture above, we now have 3 models running in parallel. The reason you might see 4 is because, by default, MLServer will print the name of the initialized model if it is one or more, and it will also print one for each of the replicas specified in the settings.
Let's get a few more twin films examples to test our server. Get as creative as you'd like. 💡
Let's first test that the function works as intended.
Now let's map three POST requests at the same time.
We can also test it one by one.
For the last step of this guide, we are going to package our model and service into a docker image that we can reuse in another project or share with colleagues immediately. This step requires that we have docker installed and configured in our PCs, so if you need to set up docker, you can do so by following the instructions in the documentation here.
The first step is to create a requirements.txt file with all of our dependencies and add it to the directory we've been using for our service (similarity_model).
The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.
A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.
MLServer has a convenient function that lets us create docker images with our services. Let's use it.
We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images command.
Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C in your terminal.
Now that you have a packaged and fully-functioning microservice with our model, we could deploy our container to a production serving platform like Seldon Core, or via different offerings available through the many cloud providers out there (e.g. AWS Lambda, Google Cloud Run, etc.). You could also run this image on KServe, a Kubernetes native tool for model serving, or anywhere else where you can bring your docker image with you.
To learn more about MLServer and the different ways in which you can use it, head over to the examples section or the user guide. To learn about some of the deployment options available, head over to the docs here.
To keep up to date with what we are up to at Seldon, make sure you join our Slack community.
pip install mlserver spacy wikipedia-apipython -m spacy download en_core_web_lgmkdir -p similarity_modelimport spacynlp = spacy.load("en_core_web_lg")import wikipediaapiwiki_wiki = wikipediaapi.Wikipedia('MyMovieEval ([email protected])', 'en')barbie = wiki_wiki.page('Barbie_(film)').summary
oppenheimer = wiki_wiki.page('Oppenheimer_(film)').summary
print(barbie)
print()
print(oppenheimer)Barbie is a 2023 American fantasy comedy film directed by Greta Gerwig and written by Gerwig and Noah Baumbach. Based on the Barbie fashion dolls by Mattel, it is the first live-action Barbie film after numerous computer-animated direct-to-video and streaming television films. The film stars Margot Robbie as Barbie and Ryan Gosling as Ken, and follows the two on a journey of self-discovery following an existential crisis. The film also features an ensemble cast that includes America Ferrera, Kate McKinnon, Issa Rae, Rhea Perlman, and Will Ferrell...
Oppenheimer is a 2023 biographical thriller film written and directed by Christopher Nolan. Based on the 2005 biography American Prometheus by Kai Bird and Martin J. Sherwin, the film chronicles the life of J. Robert Oppenheimer, a theoretical physicist who was pivotal in developing the first nuclear weapons as part of the Manhattan Project, and thereby ushering in the Atomic Age. Cillian Murphy stars as Oppenheimer, with Emily Blunt as Oppenheimer's wife Katherine "Kitty" Oppenheimer; Matt Damon as General Leslie Groves, director of the Manhattan Project; and Robert Downey Jr. as Lewis Strauss, a senior member of the United States Atomic Energy Commission. The ensemble supporting cast includes Florence Pugh, Josh Hartnett, Casey Affleck, Rami Malek, Gary Oldman and Kenneth Branagh...doc1 = nlp(barbie)
doc2 = nlp(oppenheimer)doc1.similarity(doc2)0.9866910567224084# similarity_model/my_model.py
from mlserver.codecs import decode_args
from mlserver import MLModel
from typing import List
import numpy as np
import spacy
class MyKulModel(MLModel):
async def load(self):
self.model = spacy.load("en_core_web_lg")
@decode_args
async def predict(self, docs: List[str]) -> np.ndarray:
doc1 = self.model(docs[0])
doc2 = self.model(docs[1])
return np.array(doc1.similarity(doc2))# similarity_model/model-settings.json
{
"name": "doc-sim-model",
"implementation": "my_model.MyKulModel"
}mlserver start similarity_model/from mlserver.codecs import StringCodec
import requestsinference_request = {
"inputs": [
StringCodec.encode_input(name='docs', payload=[barbie, oppenheimer], use_bytes=False).model_dump()
]
}
print(inference_request){'inputs': [{'name': 'docs',
'shape': [2, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'str'},
'data': [
'Barbie is a 2023 American fantasy comedy...',
'Oppenheimer is a 2023 biographical thriller...'
]
}]
}r = requests.post('http://0.0.0.0:8080/v2/models/doc-sim-model/infer', json=inference_request)r.json(){'model_name': 'doc-sim-model',
'id': 'a4665ddb-1868-4523-bd00-a25902d9b124',
'parameters': {},
'outputs': [{'name': 'output-0',
'shape': [1],
'datatype': 'FP64',
'parameters': {'content_type': 'np'},
'data': [0.9866910567224084]}]}print(f"Our movies are {round(r.json()['outputs'][0]['data'][0] * 100, 4)}% similar!")Our movies are 98.6691% similar# similarity_model/settings.json
{
"parallel_workers": 3
}mlserver start similarity_modeldeep_impact = wiki_wiki.page('Deep_Impact_(film)').summary
armageddon = wiki_wiki.page('Armageddon_(1998_film)').summary
antz = wiki_wiki.page('Antz').summary
a_bugs_life = wiki_wiki.page("A_Bug's_Life").summary
the_dark_night = wiki_wiki.page('The_Dark_Knight').summary
mamma_mia = wiki_wiki.page('Mamma_Mia!_(film)').summarydef get_sim_score(movie1, movie2):
response = requests.post(
'http://0.0.0.0:8080/v2/models/doc-sim-model/infer',
json={
"inputs": [
StringCodec.encode_input(name='docs', payload=[movie1, movie2], use_bytes=False).model_dump()
]
})
return response.json()['outputs'][0]['data'][0]get_sim_score(deep_impact, armageddon)0.9569279450151813results = list(
map(get_sim_score, (deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia))
)
results[0.9569279450151813, 0.9725374771538605, 0.9626173937217876]for movie1, movie2 in zip((deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia)):
print(get_sim_score(movie1, movie2))0.9569279450151813
0.9725374771538605
0.9626173937217876# similarity_model/requirements.txt
mlserver
spacy==3.6.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whlmlserver build similarity_model/ -t 'fancy_ml_service'docker imagesdocker run -it --rm -p 8080:8080 fancy_ml_serviceMLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).
This example will walk you through some examples which illustrate how this works, and how it can be extended.
To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.
Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.
As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:
Is there any content type defined in the inputs[].parameters.content_type field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?
Is there any default content type that should be assumed?
In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..
Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)
As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.
These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.
Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.
Following with our previous example, the same code could be rewritten using codecs as:
Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.
Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.
After adding this metadata, we will re-start MLServer (e.g. mlserver start .) and we will send a new request without any explicit parameters.
As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.
There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.
In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.
As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).
So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.
To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.
%%writefile runtime.py
import json
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName
_to_exclude = {
"parameters": {DecodedParameterName, "headers"},
'inputs': {"__all__": {"parameters": {DecodedParameterName, "headers"}}}
}
class EchoRuntime(MLModel):
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
outputs = []
for request_input in payload.inputs:
decoded_input = self.decode(request_input)
print(f"------ Encoded Input ({request_input.name}) ------")
as_dict = request_input.dict(exclude=_to_exclude) # type: ignore
print(json.dumps(as_dict, indent=2))
print(f"------ Decoded input ({request_input.name}) ------")
print(decoded_input)
outputs.append(
ResponseOutput(
name=request_input.name,
datatype=request_input.datatype,
shape=request_input.shape,
data=request_input.data
)
)
return InferenceResponse(model_name=self.name, outputs=outputs)
%%writefile model-settings.json
{
"name": "content-type-example",
"implementation": "runtime.EchoRuntime"
}import requests
payload = {
"inputs": [
{
"name": "parameters-np",
"datatype": "INT32",
"shape": [2, 2],
"data": [1, 2, 3, 4],
"parameters": {
"content_type": "np"
}
},
{
"name": "parameters-str",
"datatype": "BYTES",
"shape": [1],
"data": "hello world 😁",
"parameters": {
"content_type": "str"
}
}
]
}
response = requests.post(
"http://localhost:8080/v2/models/content-type-example/infer",
json=payload
)import requests
import numpy as np
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import NumpyCodec, StringCodec
parameters_np = np.array([[1, 2], [3, 4]])
parameters_str = ["hello world 😁"]
payload = InferenceRequest(
inputs=[
NumpyCodec.encode_input("parameters-np", parameters_np),
# The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible
StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False),
]
)
response = requests.post(
"http://localhost:8080/v2/models/content-type-example/infer",
json=payload.model_dump()
)
response_payload = InferenceResponse.parse_raw(response.text)
print(NumpyCodec.decode_output(response_payload.outputs[0]))
print(StringCodec.decode_output(response_payload.outputs[1]))%%writefile model-settings.json
{
"name": "content-type-example",
"implementation": "runtime.EchoRuntime",
"inputs": [
{
"name": "metadata-np",
"datatype": "INT32",
"shape": [2, 2],
"parameters": {
"content_type": "np"
}
},
{
"name": "metadata-str",
"datatype": "BYTES",
"shape": [11],
"parameters": {
"content_type": "str"
}
}
]
}import requests
payload = {
"inputs": [
{
"name": "metadata-np",
"datatype": "INT32",
"shape": [2, 2],
"data": [1, 2, 3, 4],
},
{
"name": "metadata-str",
"datatype": "BYTES",
"shape": [11],
"data": "hello world 😁",
}
]
}
response = requests.post(
"http://localhost:8080/v2/models/content-type-example/infer",
json=payload
)%%writefile runtime.py
import io
import json
from PIL import Image
from mlserver import MLModel
from mlserver.types import (
InferenceRequest,
InferenceResponse,
RequestInput,
ResponseOutput,
)
from mlserver.codecs import NumpyCodec, register_input_codec, DecodedParameterName
from mlserver.codecs.utils import InputOrOutput
_to_exclude = {
"parameters": {DecodedParameterName},
"inputs": {"__all__": {"parameters": {DecodedParameterName}}},
}
@register_input_codec
class PillowCodec(NumpyCodec):
ContentType = "img"
DefaultMode = "L"
@classmethod
def can_encode(cls, payload: Image) -> bool:
return isinstance(payload, Image)
@classmethod
def _decode(cls, input_or_output: InputOrOutput) -> Image:
if input_or_output.datatype != "BYTES":
# If not bytes, assume it's an array
image_array = super().decode_input(input_or_output) # type: ignore
return Image.fromarray(image_array, mode=cls.DefaultMode)
encoded = input_or_output.data
if isinstance(encoded, str):
encoded = encoded.encode()
return Image.frombytes(
mode=cls.DefaultMode, size=input_or_output.shape, data=encoded
)
@classmethod
def encode_output(cls, name: str, payload: Image) -> ResponseOutput: # type: ignore
byte_array = io.BytesIO()
payload.save(byte_array, mode=cls.DefaultMode)
return ResponseOutput(
name=name, shape=payload.size, datatype="BYTES", data=byte_array.getvalue()
)
@classmethod
def decode_output(cls, response_output: ResponseOutput) -> Image:
return cls._decode(response_output)
@classmethod
def encode_input(cls, name: str, payload: Image) -> RequestInput: # type: ignore
output = cls.encode_output(name, payload)
return RequestInput(
name=output.name,
shape=output.shape,
datatype=output.datatype,
data=output.data,
)
@classmethod
def decode_input(cls, request_input: RequestInput) -> Image:
return cls._decode(request_input)
class EchoRuntime(MLModel):
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
outputs = []
for request_input in payload.inputs:
decoded_input = self.decode(request_input)
print(f"------ Encoded Input ({request_input.name}) ------")
as_dict = request_input.dict(exclude=_to_exclude) # type: ignore
print(json.dumps(as_dict, indent=2))
print(f"------ Decoded input ({request_input.name}) ------")
print(decoded_input)
outputs.append(
ResponseOutput(
name=request_input.name,
datatype=request_input.datatype,
shape=request_input.shape,
data=request_input.data,
)
)
return InferenceResponse(model_name=self.name, outputs=outputs)import requests
payload = {
"inputs": [
{
"name": "image-int32",
"datatype": "INT32",
"shape": [8, 8],
"data": [
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0
],
"parameters": {
"content_type": "img"
}
},
{
"name": "image-bytes",
"datatype": "BYTES",
"shape": [8, 8],
"data": (
"10101010"
"10101010"
"10101010"
"10101010"
"10101010"
"10101010"
"10101010"
"10101010"
),
"parameters": {
"content_type": "img"
}
}
]
}
response = requests.post(
"http://localhost:8080/v2/models/content-type-example/infer",
json=payload
)%%writefile runtime.py
import json
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName
_to_exclude = {
"parameters": {DecodedParameterName},
'inputs': {"__all__": {"parameters": {DecodedParameterName}}}
}
class EchoRuntime(MLModel):
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
print("------ Encoded Input (request) ------")
as_dict = payload.dict(exclude=_to_exclude) # type: ignore
print(json.dumps(as_dict, indent=2))
print("------ Decoded input (request) ------")
decoded_request = None
if payload.parameters:
decoded_request = getattr(payload.parameters, DecodedParameterName)
print(decoded_request)
outputs = []
for request_input in payload.inputs:
outputs.append(
ResponseOutput(
name=request_input.name,
datatype=request_input.datatype,
shape=request_input.shape,
data=request_input.data
)
)
return InferenceResponse(model_name=self.name, outputs=outputs)
import requests
payload = {
"inputs": [
{
"name": "parameters-np",
"datatype": "INT32",
"shape": [2, 2],
"data": [1, 2, 3, 4],
"parameters": {
"content_type": "np"
}
},
{
"name": "parameters-str",
"datatype": "BYTES",
"shape": [2, 11],
"data": ["hello world 😁", "bye bye 😁"],
"parameters": {
"content_type": "str"
}
}
],
"parameters": {
"content_type": "pd"
}
}
response = requests.post(
"http://localhost:8080/v2/models/content-type-example/infer",
json=payload
)Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the V2 Inference Protocol doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.
To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.
To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.
To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.
As an example, we can consider the following dataframe, containing two columns: Age and First Name.
This table, could be specified in the V2 protocol as the following payload, where we declare that:
The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).
To learn more about the available content types and how to use them, you can see all the available ones in the section below.
Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.
Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.
However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.
To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):
Request Codecs
encode_request() <mlserver.codecs.RequestCodec.encode_request>
decode_request() <mlserver.codecs.RequestCodec.decode_request>
Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.
For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:
For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .
When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.
Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).
For example, if we want to send an inference request to model foo, we could do something along the following lines:
The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.
In order to send / receive NaN values, you must ensure that:
You are using the REST interface.
The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.
You are either using the or the .
Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.
For example, if you take the following Numpy array:
We could encode it as:
Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.
For example, to configure the content type values of the , one could create a model-settings.json file like the one below:
It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.
Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.
The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:
The datatype field will be matched to the closest .
The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.
For example, if we think of the following NumPy Array:
We could encode it as the input foo in a V2 protocol request as:
When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.
The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,
Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape field of each input (or output
For example, if we consider the following dataframe:
We could encode it to the V2 Inference Protocol as:
The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:
The expected datatype is BYTES.
The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).
For example, when if we consider the following list of strings:
We could encode it to the V2 Inference Protocol as:
When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.
The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:
The expected datatype is BYTES.
The data field should contain the base64-encoded binary strings.
The shape field represents the number of binary strings that are encoded in the payload.
For example, if we think of the following "bytes array":
We could encode it as the input foo of a V2 request as:
The datetime content type will decode a V2 input into a , taking into account the following:
The expected datatype is BYTES.
The data field should contain the dates serialised following the .
The shape field represents the number of datetimes that are encoded in the payload.
For example, if we think of the following datetime object:
We could encode it as the input foo of a V2 request as:
encode_response() <mlserver.codecs.RequestCodec.encode_response>decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
encode_input() <mlserver.codecs.InputCodec.encode_input>
decode_input() <mlserver.codecs.InputCodec.decode_input>
encode_output() <mlserver.codecs.InputCodec.encode_output>
decode_output() <mlserver.codecs.InputCodec.decode_output>
❌
str
✅
mlserver.codecs.string.StringRequestCodec
✅
mlserver.codecs.StringCodec
base64
❌
✅
mlserver.codecs.Base64Codec
datetime
❌
✅
mlserver.codecs.DatetimeCodec
Joanne
34
Michael
22
np
✅
mlserver.codecs.NumpyRequestCodec
✅
mlserver.codecs.NumpyCodec
pd
✅
a1
b1
c1
a2
b2
c2
a3
b3
c3
a4
b4
c4
mlserver.codecs.PandasCodec
{
"parameters": {
"content_type": "pd"
},
"inputs": [
{
"name": "First Name",
"datatype": "BYTES",
"parameters": {
"content_type": "str"
},
"shape": [2],
"data": ["Joanne", "Michael"]
},
{
"name": "Age",
"datatype": "INT32",
"shape": [2],
"data": [34, 22]
},
]
}import pandas as pd
from mlserver.codecs import PandasCodec
dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
inference_request = PandasCodec.encode_request(dataframe)
print(inference_request)import pandas as pd
import requests
from mlserver.codecs import PandasCodec
dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
inference_request = PandasCodec.encode_request(dataframe)
# raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
raw_request = inference_request.dict()
response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)
# raw_response will be a dictionary (loaded from the response's JSON),
# therefore we can pass it as the InferenceResponse constructors' kwargs
raw_response = response.json()
inference_response = InferenceResponse(**raw_response)import numpy as np
foo = np.array([[1.2, 2.3], [np.NaN, 4.5]]){
"inputs": [
{
"name": "foo",
"parameters": {
"content_type": "np"
},
"data": [1.2, 2.3, null, 4.5]
"datatype": "FP64",
"shape": [2, 2],
}
]
}{
"parameters": {
"content_type": "pd"
},
"inputs": [
{
"name": "First Name",
"datatype": "BYTES",
"parameters": {
"content_type": "str"
},
"shape": [-1],
},
{
"name": "Age",
"datatype": "INT32",
"shape": [-1],
},
]
}import numpy as np
foo = np.array([[1, 2], [3, 4]]){
"inputs": [
{
"name": "foo",
"parameters": {
"content_type": "np"
},
"data": [1, 2, 3, 4]
"datatype": "INT32",
"shape": [2, 2],
}
]
}from mlserver.codecs import NumpyRequestCodec
# Encode an entire V2 request
inference_request = NumpyRequestCodec.encode_request(foo)from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec
# We can use the `NumpyCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
inputs=[
NumpyCodec.encode_input("foo", foo)
]
){
"parameters": {
"content_type": "pd"
},
"inputs": [
{
"name": "A",
"data": ["a1", "a2", "a3", "a4"]
"datatype": "BYTES",
"shape": [4],
},
{
"name": "B",
"data": ["b1", "b2", "b3", "b4"]
"datatype": "BYTES",
"shape": [4],
},
{
"name": "C",
"data": ["c1", "c2", "c3", "c4"]
"datatype": "BYTES",
"shape": [4],
},
]
}import pandas as pd
from mlserver.codecs import PandasCodec
foo = pd.DataFrame({
"A": ["a1", "a2", "a3", "a4"],
"B": ["b1", "b2", "b3", "b4"],
"C": ["c1", "c2", "c3", "c4"]
})
inference_request = PandasCodec.encode_request(foo)foo = ["bar", "bar2"]{
"parameters": {
"content_type": "str"
},
"inputs": [
{
"name": "foo",
"data": ["bar", "bar2"]
"datatype": "BYTES",
"shape": [2],
}
]
}from mlserver.codecs.string import StringRequestCodec
# Encode an entire V2 request
inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)from mlserver.types import InferenceRequest
from mlserver.codecs import StringCodec
# We can use the `StringCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
inputs=[
StringCodec.encode_input("foo", foo, use_bytes=False)
]
)foo = b"Python is fun"{
"inputs": [
{
"name": "foo",
"parameters": {
"content_type": "base64"
},
"data": ["UHl0aG9uIGlzIGZ1bg=="]
"datatype": "BYTES",
"shape": [1],
}
]
}from mlserver.types import InferenceRequest
from mlserver.codecs import Base64Codec
# We can use the `Base64Codec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
inputs=[
Base64Codec.encode_input("foo", foo, use_bytes=False)
]
)import datetime
foo = datetime.datetime(2022, 1, 11, 11, 0, 0){
"inputs": [
{
"name": "foo",
"parameters": {
"content_type": "datetime"
},
"data": ["2022-01-11T11:00:00"]
"datatype": "BYTES",
"shape": [1],
}
]
}from mlserver.types import InferenceRequest
from mlserver.codecs import DatetimeCodec
# We can use the `DatetimeCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
inputs=[
DatetimeCodec.encode_input("foo", foo, use_bytes=False)
]
)An enumeration.
An enumeration.
parameters
Optional[Parameters]
None
-
outputs
List[ResponseOutput]
-
-
parameters
Optional[Parameters]
None
-
parameters
Optional[Parameters]
None
-
platform
str
-
-
versions
Optional[List[str]]
None
-
shape
List[int]
-
-
version
Optional[str]
None
-
parameters
Optional[Parameters]
None
-
shape
List[int]
-
-
parameters
Optional[Parameters]
None
-
shape
List[int]
-
-
error
Optional[str]
None
-
id
Optional[str]
None
-
inputs
List[RequestInput]
-
-
outputs
Optional[List[RequestOutput]]
None
id
Optional[str]
None
-
model_name
str
-
-
model_version
Optional[str]
None
error
str
-
-
inputs
Optional[List[MetadataTensor]]
None
-
name
str
-
-
outputs
Optional[List[MetadataTensor]]
None
error
str
-
-
extensions
List[str]
-
-
name
str
-
-
version
str
-
datatype
Datatype
-
-
name
str
-
-
parameters
Optional[Parameters]
None
content_type
Optional[str]
None
-
headers
Optional[Dict[str, Any]]
None
-
ready
Optional[bool]
None
-
root
List[RepositoryIndexResponseItem]
-
-
name
str
-
-
reason
str
-
-
state
State
-
error
Optional[str]
None
-
error
Optional[str]
None
-
data
TensorData
-
-
datatype
Datatype
-
-
name
str
-
name
str
-
-
parameters
Optional[Parameters]
None
-
data
TensorData
-
-
datatype
Datatype
-
-
name
str
-
root
Union[List[Any], Any]
-
-
{
"properties": {
"error": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Error"
}
},
"title": "InferenceErrorResponse",
"type": "object"
}
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
},
"RequestInput": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"data": {
"$ref": "#/$defs/TensorData"
}
},
"required": [
"name",
"shape",
"datatype",
"data"
],
"title": "RequestInput",
"type": "object"
},
"RequestOutput": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
}
},
"required": [
"name"
],
"title": "RequestOutput",
"type": "object"
},
"TensorData": {
"anyOf": [
{
"items": {},
"type": "array"
},
{}
],
"title": "TensorData"
}
},
"properties": {
"id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Id"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"inputs": {
"items": {
"$ref": "#/$defs/RequestInput"
},
"title": "Inputs",
"type": "array"
},
"outputs": {
"anyOf": [
{
"items": {
"$ref": "#/$defs/RequestOutput"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Outputs"
}
},
"required": [
"inputs"
],
"title": "InferenceRequest",
"type": "object"
}
-
-
-
-
-
-
-
-
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
},
"ResponseOutput": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"data": {
"$ref": "#/$defs/TensorData"
}
},
"required": [
"name",
"shape",
"datatype",
"data"
],
"title": "ResponseOutput",
"type": "object"
},
"TensorData": {
"anyOf": [
{
"items": {},
"type": "array"
},
{}
],
"title": "TensorData"
}
},
"properties": {
"model_name": {
"title": "Model Name",
"type": "string"
},
"model_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Model Version"
},
"id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Id"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"outputs": {
"items": {
"$ref": "#/$defs/ResponseOutput"
},
"title": "Outputs",
"type": "array"
}
},
"required": [
"model_name",
"outputs"
],
"title": "InferenceResponse",
"type": "object"
}
{
"properties": {
"error": {
"title": "Error",
"type": "string"
}
},
"required": [
"error"
],
"title": "MetadataModelErrorResponse",
"type": "object"
}
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"MetadataTensor": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
}
},
"required": [
"name",
"datatype",
"shape"
],
"title": "MetadataTensor",
"type": "object"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"versions": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Versions"
},
"platform": {
"title": "Platform",
"type": "string"
},
"inputs": {
"anyOf": [
{
"items": {
"$ref": "#/$defs/MetadataTensor"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Inputs"
},
"outputs": {
"anyOf": [
{
"items": {
"$ref": "#/$defs/MetadataTensor"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Outputs"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
}
},
"required": [
"name",
"platform"
],
"title": "MetadataModelResponse",
"type": "object"
}
{
"properties": {
"error": {
"title": "Error",
"type": "string"
}
},
"required": [
"error"
],
"title": "MetadataServerErrorResponse",
"type": "object"
}
{
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"version": {
"title": "Version",
"type": "string"
},
"extensions": {
"items": {
"type": "string"
},
"title": "Extensions",
"type": "array"
}
},
"required": [
"name",
"version",
"extensions"
],
"title": "MetadataServerResponse",
"type": "object"
}
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
}
},
"required": [
"name",
"datatype",
"shape"
],
"title": "MetadataTensor",
"type": "object"
}
{
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
}
{
"properties": {
"ready": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"title": "Ready"
}
},
"title": "RepositoryIndexRequest",
"type": "object"
}
{
"$defs": {
"RepositoryIndexResponseItem": {
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Version"
},
"state": {
"$ref": "#/$defs/State"
},
"reason": {
"title": "Reason",
"type": "string"
}
},
"required": [
"name",
"state",
"reason"
],
"title": "RepositoryIndexResponseItem",
"type": "object"
},
"State": {
"enum": [
"UNKNOWN",
"READY",
"UNAVAILABLE",
"LOADING",
"UNLOADING"
],
"title": "State",
"type": "string"
}
},
"items": {
"$ref": "#/$defs/RepositoryIndexResponseItem"
},
"title": "RepositoryIndexResponse",
"type": "array"
}
{
"$defs": {
"State": {
"enum": [
"UNKNOWN",
"READY",
"UNAVAILABLE",
"LOADING",
"UNLOADING"
],
"title": "State",
"type": "string"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Version"
},
"state": {
"$ref": "#/$defs/State"
},
"reason": {
"title": "Reason",
"type": "string"
}
},
"required": [
"name",
"state",
"reason"
],
"title": "RepositoryIndexResponseItem",
"type": "object"
}
{
"properties": {
"error": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Error"
}
},
"title": "RepositoryLoadErrorResponse",
"type": "object"
}
{
"properties": {
"error": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Error"
}
},
"title": "RepositoryUnloadErrorResponse",
"type": "object"
}
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
},
"TensorData": {
"anyOf": [
{
"items": {},
"type": "array"
},
{}
],
"title": "TensorData"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"data": {
"$ref": "#/$defs/TensorData"
}
},
"required": [
"name",
"shape",
"datatype",
"data"
],
"title": "RequestInput",
"type": "object"
}
{
"$defs": {
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
}
},
"required": [
"name"
],
"title": "RequestOutput",
"type": "object"
}
{
"$defs": {
"Datatype": {
"enum": [
"BOOL",
"UINT8",
"UINT16",
"UINT32",
"UINT64",
"INT8",
"INT16",
"INT32",
"INT64",
"FP16",
"FP32",
"FP64",
"BYTES"
],
"title": "Datatype",
"type": "string"
},
"Parameters": {
"additionalProperties": true,
"properties": {
"content_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Content Type"
},
"headers": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"title": "Headers"
}
},
"title": "Parameters",
"type": "object"
},
"TensorData": {
"anyOf": [
{
"items": {},
"type": "array"
},
{}
],
"title": "TensorData"
}
},
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"shape": {
"items": {
"type": "integer"
},
"title": "Shape",
"type": "array"
},
"datatype": {
"$ref": "#/$defs/Datatype"
},
"parameters": {
"anyOf": [
{
"$ref": "#/$defs/Parameters"
},
{
"type": "null"
}
],
"default": null
},
"data": {
"$ref": "#/$defs/TensorData"
}
},
"required": [
"name",
"shape",
"datatype",
"data"
],
"title": "ResponseOutput",
"type": "object"
}
{
"anyOf": [
{
"items": {},
"type": "array"
},
{}
],
"title": "TensorData"
}