Advanced Use Cases
In this tutorial, we demonstrate some advanced use case of the Local Embeddings MLServer runtime. Before proceeding with this tutorial please ensure you have the Local Embeddings server up and running - see our installation guidlines on how to do so. We also recommend that you are familiar with the basic use case of the runtime, thus we recommed to first follow the example here.
Before deploying the actuall models, we define a helper function you retrieve the correct IP. The IP address will be used to define the endpoints where the inference requests will be sent.
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
We are now ready to dive into some more advanced functionalities.
Prompt templating
Some retrieval models achieve optimal performance when the input text is prefixed with a specific passage. For example, for [BAAI/bge-large-en-v1.5
](link here) must be prefixed with "Represent this sentence for searching relevant passages: "
, or intfloat/multilingual-e5-large
should be prefixed "query: "
for all queries and with "passage: "
for all passages.
For the SentenceTransformers backend, one can specify the prompts and the default prompt to be used in the model-settings.json
file:
!cat ./models/embed-prompts/model-settings.json
{
"name": "embed-prompts",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "intfloat/multilingual-e5-large",
"device": "cuda",
"prompts": {
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: "
},
"default_prompt_name": "retrieval"
}
}
}
}
}
The associated manifest file is:
!cat manifests/embed-prompts.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-prompts
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-prompts"
requirements:
- local-embeddings
To load the model on SC2, run the following command:
!kubectl apply -f manifests/embed-prompts.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-prompts created
model.mlops.seldon.io/embed-prompts condition met
Now, we can query the model as before:
import pprint
import requests
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": ["How to bake a cake"],
}
]
}
endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': '0cbe36df-ffc6-4527-92b4-858b8bfe2293',
'model_name': 'embed-prompts_1',
'model_version': '1',
'outputs': [{'data': [...],
'datatype': 'FP32',
'name': 'embedding',
'parameters': {...},
'shape': [...]}],
'parameters': {}}
You can also send a prompt as a parameter of the inference request:
import pprint
import requests
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": ["How to bake a cake"],
}
],
"parameters": {
"embedding_parameters": {
"prompt": "Query: "
}
}
}
endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': '52bcd476-75d2-4135-a8b6-a0ab43733dec',
'model_name': 'embed-prompts_1',
'model_version': '1',
'outputs': [{'data': [...],
'datatype': 'FP32',
'name': 'embedding',
'parameters': {...},
'shape': [...]}],
'parameters': {}}
Note that the embeddings differ because we changed the prompt.
Input sequence length
Transformers models like BERT, RoBERTa, DistilBERT, etc., have a quadratic complexity with the input length for the runtime and memory. BERT-based models, usually limit the input text to 512 tokens which is equivalent to 300-400 words in English. SentenceTransformers backend allows you to set the maximum input length to a specific value to reduce the complexity of the operation.
We begin by deploying an embedding model with the default sequence length so we can compare the embeddings against one which will use a smaller sequence length. The model-settings.json
file for the default setup is the following:
!cat models/embed-gpu/model-settings.json
{
"name": "embed-gpu",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"device": "cuda"
}
}
}
}
}
The associated manifest file is:
!cat manifests/embed-gpu.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-gpu
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-gpu"
requirements:
- local-embeddings
We can now deploy the model by running the following command:
!kubectl apply -f manifests/embed-gpu.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-gpu created
model.mlops.seldon.io/embed-gpu condition met
model.mlops.seldon.io/embed-prompts condition met
Now we can deploy the embedding model with the shorter sequence length. The following model-settings.json
file configures the deployment to truncate the input to 200 tokens:
!cat models/embed-max-seq-len/model-settings.json
{
"name": "embed-max-seq-len",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"max_seq_length": 200,
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"device": "cuda"
}
}
}
}
}
The associate manifest file is:
!cat ./manifests/embed-max-seq-len.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-max-seq-len
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-max-seq-len"
requirements:
- local-embeddings
We now deploy the model as before, by running the following command:
!kubectl apply -f manifests/embed-max-seq-len.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-max-seq-len created
model.mlops.seldon.io/embed-gpu condition met
model.mlops.seldon.io/embed-max-seq-len condition met
model.mlops.seldon.io/embed-prompts condition met
We can send requests to the embed-gpu
model and embed-max-seq-len
to compare the results.
We begin by sending a small input request that does not exceed the 200 tokens. Both models should return the same embeddings.
import requests
import numpy as np
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": ["Seldon Core 2 is the best!"],
}
]
}
endpoint1 = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response1 = requests.post(endpoint1, json=inference_request)
endpoint2 = f"http://{get_mesh_ip()}/v2/models/embed-max-seq-len/infer"
response2 = requests.post(endpoint2, json=inference_request)
print(
f"Are the outputs the same? {np.allclose(response1.json()['outputs'][0]['data'], response2.json()['outputs'][0]['data'])}"
)
Are the outputs the same? True
We now send a longer input that exceeds 200 tokens. We expect the results to differ since the second model will truncate the input.
import requests
import numpy as np
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": [
"Kubernetes, commonly abbreviated as K8s, is an open-source platform developed to automate the "
"deployment, scaling, and management of containerized applications. Initially created by Google "
"and now maintained by the Cloud Native Computing Foundation (CNCF), Kubernetes provides a robust "
"system for managing distributed applications across different environments, including on-premises "
"servers and cloud infrastructures. The platform abstracts the underlying infrastructure, allowing "
"developers to concentrate on application development rather than hardware details. Kubernetes "
"organizes its architecture around clusters, which consist of worker machines known as nodes. These "
"nodes run containerized applications, while the Kubernetes control plane, which includes components "
"like the API server, scheduler, and controller manager, handles the overall cluster management. "
"Kubernetes offers several key features that make it indispensable for modern application management. "
"One of its primary functions is container orchestration, which allows it to manage and distribute "
"numerous containers across nodes to ensure high availability and resilience. It automates rollouts "
"and rollbacks, enabling smooth application updates and quick recovery to previous states if issues "
"arise. Self-healing is another significant feature, where Kubernetes monitors the cluster’s state and "
"can restart failed containers, replace non-responsive nodes, and terminate containers failing health "
"checks. It also provides load balancing and service discovery, ensuring even distribution of network "
"traffic across containers and facilitating communication between services. Additionally, Kubernetes "
"supports storage orchestration, allowing for the automatic mounting of storage systems from various "
"sources, such as local storage or cloud providers. For managing sensitive information, Kubernetes "
"offers secure secret and configuration management, handling sensitive data like passwords and application "
"configurations effectively. These features allow Kubernetes to support a microservices architecture, "
"where different parts of an application can be developed, deployed, and scaled independently. This "
"architecture improves development speed, scalability, and the overall reliability of applications. "
"Kubernetes has become the standard for container orchestration due to its flexibility and the powerful "
"tools it provides for managing complex applications at scale. By enabling consistent and repeatable "
"deployment processes, Kubernetes helps organizations maintain stability across different environments, "
"contributing to improved operational efficiency and development agility. Its active community and "
"continuous enhancements have solidified its place as a cornerstone in cloud-native application development "
"and modern DevOps practices, making it an essential tool for any organization looking to efficiently manage "
"containerized applications at scale."
],
}
]
}
endpoint1 = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response1 = requests.post(endpoint1, json=inference_request)
endpoint2 = f"http://{get_mesh_ip()}/v2/models/embed-max-seq-len/infer"
response2 = requests.post(endpoint2, json=inference_request)
print(
f"Are the outputs the same? {np.allclose(response1.json()['outputs'][0]['data'], response2.json()['outputs'][0]['data'])}"
)
Are the outputs the same? False
To unload the models, run the following commands:
!kubectl delete -f manifests/embed-gpu.yaml -n seldon
!kubectl delete -f manifests/embed-max-seq-len.yaml -n seldon
model.mlops.seldon.io "embed-gpu" deleted
model.mlops.seldon.io "embed-max-seq-len" deleted
Multi-Process / Multi-GPU encoding
One can choose to encode the input text using more than one CPU or more than one GPU. In this example, we will demonstrate it for CPUs, but one can choose the latter option by simply modifying the device
to "cuda"
.
The model-settings.json
for this use case is the following:
!cat ./models/embed-multi-proc/model-settings.json
{
"name": "embed-gpu",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"multi_process": true,
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"device": "cpu"
}
}
}
}
}
The associated manifest file is:
!cat ./manifests/embed-multi-proc.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-multi-proc
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-multi-proc"
requirements:
- local-embeddings
To deploy the model, run the following command:
!kubectl apply -f manifests/embed-multi-proc.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-multi-proc created
model.mlops.seldon.io/embed-multi-proc condition met
model.mlops.seldon.io/embed-prompts condition met
We can send a bigger batch of requests to be embedded. You can monitor your CPU usage to see that the backend actually uses multiple CPUs. In the following example, we will send a batch of 10000 requests.
import requests
batch_size = 10_000
inference_request = {
"inputs": [
{
"name": "input",
"shape": [batch_size],
"datatype": "BYTES",
"data": [
"Kubernetes, commonly abbreviated as K8s, is an open-source platform developed to automate the "
"deployment, scaling, and management of containerized applications."
] * batch_size
}
]
}
endpoint = f"http://{get_mesh_ip()}/v2/models/embed-multi-proc/infer"
response = requests.post(endpoint, json=inference_request)
print("Status: ", response.status_code)
Status: 200
To unload the model, run the following command:
!kubectl delete -f manifests/embed-multi-proc.yaml -n seldon
model.mlops.seldon.io "embed-multi-proc" deleted
Binary Quantization
SentenceTransformes has an option to quantize the embedding in a binary representation. For example, a 1024 embedding will be binarized in 1024 bits. Behind the scenes, SentenceTransformers packs the bits into bytes using np.packbits
which will result in 128 bytes.
We will query the same embed-prompt
model from above, by passing a parameter in the inference request which specifies the desired precision.
import requests
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": ["How to bake a cake"],
}
],
"parameters": {
"embedding_parameters": {
"precision": "binary"
}
}
}
endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
print("Output shape:", response.json()["outputs"][0]["shape"])
print("Output datatype:", response.json()["outputs"][0]["datatype"])
Output shape: [1, 128]
Output datatype: INT8
As we can see, the embedding representation uses integers and the dimensionality of the embedding is 128.
In case the desired default behavior of the model is to quantize the embedding using the binary, precision, one can specify that in the model-settings.json
file as:
!cat ./models/embed-binary/model-settings.json
{
"name": "embed-binary",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "intfloat/multilingual-e5-large",
"device": "cuda",
"prompts": {
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: "
},
"default_prompt_name": "retrieval",
"embedding_parameters": {
"precision": "binary"
}
}
}
}
}
}
Unload the model by running the following command:
!kubectl delete -f manifests/embed-prompts.yaml -n seldon
model.mlops.seldon.io "embed-prompts" deleted
Scalar (int8) Quantization
SentenceTransformers also supports scalar quantization which transforms the float32
representation into int8
. This process consists of mapping the continuous range of float32
values into a discrete set of int8
values, which can take 256 distinct values, from -128 to 127. For this, one has to provide a large calibration dataset which will guide the discretization process.
The model-settings.json
file for a scalar quantization model looks like this:
!cat ./models/embed-scalar-quant/model-settings.json
{
"name": "embed-scalar-quant",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "mixedbread-ai/mxbai-embed-large-v1",
"device": "cuda"
},
"calibration_dataset": {
"name": "nq_open",
"size": 1000,
"column": "question"
},
"quantization_precision": "int8"
}
}
}
}
The corresponding manifest file is:
!cat ./manifests/embed-scalar-quant.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-scalar-quant
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-scalar-quant"
requirements:
- local-embeddings
To deploy the model on SC2, run the following command:
!kubectl apply -f manifests/embed-scalar-quant.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-scalar-quant created
model.mlops.seldon.io/embed-scalar-quant condition met
Once the model is ready, we can send a request:
import requests
inference_request = {
"inputs": [
{
"name": "input",
"shape": [1],
"datatype": "BYTES",
"data": ["How to bake a cake"],
}
],
"parameters": {
"embedding_parameters": {
"quantize_embeddings": True
}
}
}
endpoint = f"http://{get_mesh_ip()}/v2/models/embed-scalar-quant/infer"
response = requests.post(endpoint, json=inference_request)
print("Output shape:", response.json()["outputs"][0]["shape"])
print("Output datatype:", response.json()["outputs"][0]["datatype"])
Output shape: [1, 1024]
Output datatype: INT8
We can observe that the output is represented as integers but it keeps the dimensionality of the embedding to 1024.
To unload the model, run the following command:
!kubectl delete -f manifests/embed-scalar-quant.yaml -n seldon
model.mlops.seldon.io "embed-scalar-quant" deleted
Speeding up Inference
SentenceTransformers supports 3 backends for computing backends, each coming with its own optimization. Before describing them briefly, we write a utility function to send a request that will be used for all the backends.
import pprint
import requests
def send_request(endpoint: str):
inference_request = {
"inputs": [
{
"name": "input",
"shape": [2],
"datatype": "BYTES",
"data": [
"This is an example sentence",
"Each sentence is converted"
],
}
],
}
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
PyTorch
By default, SentenceTransformers uses the PyTorch backend which loads the model on the strongest available device option (i.e., "cuda", "mps", and "cpu") and performs all the operations on float32
precision. To speed up inference and memory footprint, one can use the float16
representation which usually leads to minimal loss in performance. The model-settings.json
for a model loaded in float16
looks like this:
!cat ./models/embed-half/model-settings.json
{
"name": "embed-half",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"device": "cuda",
"model_kwargs": {
"torch_dtype": "float16"
}
}
}
}
}
}
The corresponding manifest file is:
!cat ./manifests/embed-half.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-half
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-half"
requirements:
- local-embeddings
We deploy the model as before by running the command:
!kubectl apply -f manifests/embed-half.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-half created
model.mlops.seldon.io/embed-half condition met
Once the model is deployed, we can perform inference by running calling the send_request
function:
send_request(endpoint=f"http://{get_mesh_ip()}/v2/models/embed-half/infer")
{'id': '81e86fe1-e849-4f6c-bcad-49d2314bb009',
'model_name': 'embed-half_1',
'model_version': '1',
'outputs': [{'data': [...],
'datatype': 'FP16',
'name': 'embedding',
'parameters': {...},
'shape': [...]}],
'parameters': {}}
ONNX backend
SentenceTransformers supports inference through ONNX by converting the model into ONNX format and using ONNX Runtime to run the model. One can choose to run inference either through CPU or GPU.
The model-settings.json
file for the ONNX backend looks like this:
!cat ./models/embed-onnx/model-settings.json
{
"name": "embed-onnx",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"backend": "onnx"
}
}
}
}
}
The corresponding manifest file is:
!cat ./manifests/embed-onnx.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-onnx
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-onnx"
requirements:
- local-embeddings
To deploy the model, run the following command:
!kubectl apply -f manifests/embed-onnx.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-onnx created
model.mlops.seldon.io/embed-half condition met
model.mlops.seldon.io/embed-onnx condition met
We can test the model by sending an inference request:
send_request(endpoint = f"http://{get_mesh_ip()}/v2/models/embed-onnx/infer")
{'id': '25f8172c-9b10-4b82-be11-c9bd58ee2ca5',
'model_name': 'embed-onnx_1',
'model_version': '1',
'outputs': [{'data': [...],
'datatype': 'FP32',
'name': 'embedding',
'parameters': {...},
'shape': [...]}],
'parameters': {}}
OpenVINO
Finally, we will showcase how to use OpenVINO to accelerate inference on CPU by exporting the model to the OpenVINO format.
The model-settings.json
for OpenVINO looks like this:
!cat ./models/embed-openvino/model-settings.json
{
"name": "embed-openvino",
"implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
"parameters": {
"extra": {
"backend": "sentence-transformers",
"config": {
"model_settings": {
"model_name_or_path": "all-MiniLM-L6-v2",
"backend": "openvino"
}
}
}
}
}
The associated manifest file is:
!cat ./manifests/embed-openvino.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: embed-openvino
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-openvino"
requirements:
- local-embeddings
We deploy the model as before:
!kubectl apply -f manifests/embed-openvino.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-openvino created
model.mlops.seldon.io/embed-half condition met
model.mlops.seldon.io/embed-onnx condition met
model.mlops.seldon.io/embed-openvino condition met
To test the model, we send an inference request:
send_request(endpoint=f"http://{get_mesh_ip()}/v2/models/embed-openvino/infer")
{'id': '405e1bb0-d4ec-45be-a0be-83c7d00fd77d',
'model_name': 'embed-openvino_1',
'model_version': '1',
'outputs': [{'data': [...],
'datatype': 'FP32',
'name': 'embedding',
'parameters': {...},
'shape': [...]}],
'parameters': {}}
For more details on the backends, please check the SentenceTransformers page on Speeding up Inference.
To unload all the models, run the following commands:
!kubectl delete -f manifests/embed-half.yaml -n seldon
!kubectl delete -f manifests/embed-onnx.yaml -n seldon
!kubectl delete -f manifests/embed-openvino.yaml -n seldon
model.mlops.seldon.io "embed-half" deleted
model.mlops.seldon.io "embed-onnx" deleted
model.mlops.seldon.io "embed-openvino" deleted
Last updated
Was this helpful?