Advanced Use Cases

In this tutorial, we demonstrate some advanced use case of the Local Embeddings MLServer runtime. Before proceeding with this tutorial please ensure you have the Local Embeddings server up and running - see our installation guidlines on how to do so. We also recommend that you are familiar with the basic use case of the runtime, thus we recommed to first follow the example here.

Before deploying the actuall models, we define a helper function you retrieve the correct IP. The IP address will be used to define the endpoints where the inference requests will be sent.

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We are now ready to dive into some more advanced functionalities.

Prompt templating

Some retrieval models achieve optimal performance when the input text is prefixed with a specific passage. For example, for [BAAI/bge-large-en-v1.5](link here) must be prefixed with "Represent this sentence for searching relevant passages: ", or intfloat/multilingual-e5-large should be prefixed "query: " for all queries and with "passage: " for all passages.

For the SentenceTransformers backend, one can specify the prompts and the default prompt to be used in the model-settings.json file:

!cat ./models/embed-prompts/model-settings.json
{
    "name": "embed-prompts",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "intfloat/multilingual-e5-large",
                    "device": "cuda",
                    "prompts": {
                        "classification": "Classify the following text: ",
                        "retrieval": "Retrieve semantically similar text: ",
                        "clustering": "Identify the topic or theme based on the text: "
                    },
                    "default_prompt_name": "retrieval"
                }
            }
        }
    }
}

The associated manifest file is:

!cat manifests/embed-prompts.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-prompts
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-prompts"
  requirements:
  - local-embeddings

To load the model on SC2, run the following command:

!kubectl apply -f manifests/embed-prompts.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-prompts created
model.mlops.seldon.io/embed-prompts condition met

Now, we can query the model as before:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["How to bake a cake"],
        }
    ]
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': '0cbe36df-ffc6-4527-92b4-858b8bfe2293',
 'model_name': 'embed-prompts_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

You can also send a prompt as a parameter of the inference request:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["How to bake a cake"],
        }
    ],
    "parameters": {
        "embedding_parameters": {
            "prompt": "Query: "
        }
    }
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': '52bcd476-75d2-4135-a8b6-a0ab43733dec',
 'model_name': 'embed-prompts_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

Note that the embeddings differ because we changed the prompt.

Input sequence length

Transformers models like BERT, RoBERTa, DistilBERT, etc., have a quadratic complexity with the input length for the runtime and memory. BERT-based models, usually limit the input text to 512 tokens which is equivalent to 300-400 words in English. SentenceTransformers backend allows you to set the maximum input length to a specific value to reduce the complexity of the operation.

We begin by deploying an embedding model with the default sequence length so we can compare the embeddings against one which will use a smaller sequence length. The model-settings.json file for the default setup is the following:

!cat models/embed-gpu/model-settings.json
{
    "name": "embed-gpu",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cuda"
                }
            }
        }
    }
}

The associated manifest file is:

!cat manifests/embed-gpu.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-gpu
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-gpu"
  requirements:
  - local-embeddings

We can now deploy the model by running the following command:

!kubectl apply -f manifests/embed-gpu.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-gpu created
model.mlops.seldon.io/embed-gpu condition met
model.mlops.seldon.io/embed-prompts condition met

Now we can deploy the embedding model with the shorter sequence length. The following model-settings.json file configures the deployment to truncate the input to 200 tokens:

!cat models/embed-max-seq-len/model-settings.json
{
    "name": "embed-max-seq-len",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "max_seq_length": 200,
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cuda"
                }
            }
        }
    }
}

The associate manifest file is:

!cat ./manifests/embed-max-seq-len.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-max-seq-len
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-max-seq-len"
  requirements:
  - local-embeddings

We now deploy the model as before, by running the following command:

!kubectl apply -f manifests/embed-max-seq-len.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-max-seq-len created
model.mlops.seldon.io/embed-gpu condition met
model.mlops.seldon.io/embed-max-seq-len condition met
model.mlops.seldon.io/embed-prompts condition met

We can send requests to the embed-gpu model and embed-max-seq-len to compare the results.

We begin by sending a small input request that does not exceed the 200 tokens. Both models should return the same embeddings.

import requests
import numpy as np

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Seldon Core 2 is the best!"],
        }
    ]
}

endpoint1 = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response1 = requests.post(endpoint1, json=inference_request)

endpoint2 = f"http://{get_mesh_ip()}/v2/models/embed-max-seq-len/infer"
response2 = requests.post(endpoint2, json=inference_request)

print(
    f"Are the outputs the same? {np.allclose(response1.json()['outputs'][0]['data'], response2.json()['outputs'][0]['data'])}"
)
Are the outputs the same? True

We now send a longer input that exceeds 200 tokens. We expect the results to differ since the second model will truncate the input.

import requests
import numpy as np

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": [
                "Kubernetes, commonly abbreviated as K8s, is an open-source platform developed to automate the "
                "deployment, scaling, and management of containerized applications. Initially created by Google "
                "and now maintained by the Cloud Native Computing Foundation (CNCF), Kubernetes provides a robust "
                "system for managing distributed applications across different environments, including on-premises "
                "servers and cloud infrastructures. The platform abstracts the underlying infrastructure, allowing "
                "developers to concentrate on application development rather than hardware details. Kubernetes "
                "organizes its architecture around clusters, which consist of worker machines known as nodes. These "
                "nodes run containerized applications, while the Kubernetes control plane, which includes components "
                "like the API server, scheduler, and controller manager, handles the overall cluster management. "
                "Kubernetes offers several key features that make it indispensable for modern application management. "
                "One of its primary functions is container orchestration, which allows it to manage and distribute "
                "numerous containers across nodes to ensure high availability and resilience. It automates rollouts "
                "and rollbacks, enabling smooth application updates and quick recovery to previous states if issues "
                "arise. Self-healing is another significant feature, where Kubernetes monitors the cluster’s state and "
                "can restart failed containers, replace non-responsive nodes, and terminate containers failing health "
                "checks. It also provides load balancing and service discovery, ensuring even distribution of network "
                "traffic across containers and facilitating communication between services. Additionally, Kubernetes "
                "supports storage orchestration, allowing for the automatic mounting of storage systems from various "
                "sources, such as local storage or cloud providers. For managing sensitive information, Kubernetes "
                "offers secure secret and configuration management, handling sensitive data like passwords and application "
                "configurations effectively. These features allow Kubernetes to support a microservices architecture, "
                "where different parts of an application can be developed, deployed, and scaled independently. This "
                "architecture improves development speed, scalability, and the overall reliability of applications. "
                "Kubernetes has become the standard for container orchestration due to its flexibility and the powerful "
                "tools it provides for managing complex applications at scale. By enabling consistent and repeatable "
                "deployment processes, Kubernetes helps organizations maintain stability across different environments, "
                "contributing to improved operational efficiency and development agility. Its active community and "
                "continuous enhancements have solidified its place as a cornerstone in cloud-native application development "
                "and modern DevOps practices, making it an essential tool for any organization looking to efficiently manage "
                "containerized applications at scale."
            ],
        }
    ]
}

endpoint1 = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response1 = requests.post(endpoint1, json=inference_request)

endpoint2 = f"http://{get_mesh_ip()}/v2/models/embed-max-seq-len/infer"
response2 = requests.post(endpoint2, json=inference_request)

print(
    f"Are the outputs the same? {np.allclose(response1.json()['outputs'][0]['data'], response2.json()['outputs'][0]['data'])}"
)
Are the outputs the same? False

To unload the models, run the following commands:

!kubectl delete -f manifests/embed-gpu.yaml -n seldon
!kubectl delete -f manifests/embed-max-seq-len.yaml -n seldon
model.mlops.seldon.io "embed-gpu" deleted
model.mlops.seldon.io "embed-max-seq-len" deleted

Multi-Process / Multi-GPU encoding

One can choose to encode the input text using more than one CPU or more than one GPU. In this example, we will demonstrate it for CPUs, but one can choose the latter option by simply modifying the device to "cuda".

The model-settings.json for this use case is the following:

!cat ./models/embed-multi-proc/model-settings.json
{
    "name": "embed-gpu",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "multi_process": true,
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cpu"
                }
            }
        }
    }
}

The associated manifest file is:

!cat ./manifests/embed-multi-proc.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-multi-proc
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-multi-proc"
  requirements:
  - local-embeddings

To deploy the model, run the following command:

!kubectl apply -f manifests/embed-multi-proc.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-multi-proc created
model.mlops.seldon.io/embed-multi-proc condition met
model.mlops.seldon.io/embed-prompts condition met

We can send a bigger batch of requests to be embedded. You can monitor your CPU usage to see that the backend actually uses multiple CPUs. In the following example, we will send a batch of 10000 requests.

import requests

batch_size = 10_000
inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [batch_size],
            "datatype": "BYTES",
            "data": [
                "Kubernetes, commonly abbreviated as K8s, is an open-source platform developed to automate the "
                "deployment, scaling, and management of containerized applications."
            ] * batch_size
        }
    ]
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-multi-proc/infer"
response = requests.post(endpoint, json=inference_request)
print("Status: ", response.status_code)
Status:  200

To unload the model, run the following command:

!kubectl delete -f manifests/embed-multi-proc.yaml -n seldon
model.mlops.seldon.io "embed-multi-proc" deleted

Binary Quantization

SentenceTransformes has an option to quantize the embedding in a binary representation. For example, a 1024 embedding will be binarized in 1024 bits. Behind the scenes, SentenceTransformers packs the bits into bytes using np.packbits which will result in 128 bytes.

We will query the same embed-prompt model from above, by passing a parameter in the inference request which specifies the desired precision.

import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["How to bake a cake"],
        }
    ],
    "parameters": {
        "embedding_parameters": {
            "precision": "binary"
        }
    }
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-prompts/infer"
response = requests.post(endpoint, json=inference_request)
print("Output shape:", response.json()["outputs"][0]["shape"])
print("Output datatype:", response.json()["outputs"][0]["datatype"])
Output shape: [1, 128]
Output datatype: INT8

As we can see, the embedding representation uses integers and the dimensionality of the embedding is 128.

In case the desired default behavior of the model is to quantize the embedding using the binary, precision, one can specify that in the model-settings.json file as:

!cat ./models/embed-binary/model-settings.json
{
    "name": "embed-binary",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "intfloat/multilingual-e5-large",
                    "device": "cuda",
                    "prompts": {
                        "classification": "Classify the following text: ",
                        "retrieval": "Retrieve semantically similar text: ",
                        "clustering": "Identify the topic or theme based on the text: "
                    },
                    "default_prompt_name": "retrieval",
                    "embedding_parameters": {
                        "precision": "binary"
                    }
                }
            }
        }
    }
}

Unload the model by running the following command:

!kubectl delete -f manifests/embed-prompts.yaml -n seldon
model.mlops.seldon.io "embed-prompts" deleted

Scalar (int8) Quantization

SentenceTransformers also supports scalar quantization which transforms the float32 representation into int8. This process consists of mapping the continuous range of float32 values into a discrete set of int8 values, which can take 256 distinct values, from -128 to 127. For this, one has to provide a large calibration dataset which will guide the discretization process.

The model-settings.json file for a scalar quantization model looks like this:

!cat ./models/embed-scalar-quant/model-settings.json
{
    "name": "embed-scalar-quant",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "mixedbread-ai/mxbai-embed-large-v1",
                    "device": "cuda"
                },
                "calibration_dataset": {
                    "name": "nq_open",
                    "size": 1000,
                    "column": "question"
                },
                "quantization_precision": "int8"
            }
        }
    }
}

The corresponding manifest file is:

!cat ./manifests/embed-scalar-quant.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-scalar-quant
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-scalar-quant"
  requirements:
  - local-embeddings

To deploy the model on SC2, run the following command:

!kubectl apply -f manifests/embed-scalar-quant.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-scalar-quant created
model.mlops.seldon.io/embed-scalar-quant condition met

Once the model is ready, we can send a request:

import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["How to bake a cake"],
        }
    ],
    "parameters": {
        "embedding_parameters": {
            "quantize_embeddings": True
        }
    }
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-scalar-quant/infer"
response = requests.post(endpoint, json=inference_request)
print("Output shape:", response.json()["outputs"][0]["shape"])
print("Output datatype:", response.json()["outputs"][0]["datatype"])
Output shape: [1, 1024]
Output datatype: INT8

We can observe that the output is represented as integers but it keeps the dimensionality of the embedding to 1024.

To unload the model, run the following command:

!kubectl delete -f manifests/embed-scalar-quant.yaml -n seldon
model.mlops.seldon.io "embed-scalar-quant" deleted

Speeding up Inference

SentenceTransformers supports 3 backends for computing backends, each coming with its own optimization. Before describing them briefly, we write a utility function to send a request that will be used for all the backends.

import pprint
import requests

def send_request(endpoint: str):
    inference_request = {
        "inputs": [
            {
                "name": "input",
                "shape": [2],
                "datatype": "BYTES",
                "data": [
                    "This is an example sentence", 
                    "Each sentence is converted"
                ],
            }
        ],
    }
    response = requests.post(endpoint, json=inference_request)
    pprint.pprint(response.json(), depth=3)

PyTorch

By default, SentenceTransformers uses the PyTorch backend which loads the model on the strongest available device option (i.e., "cuda", "mps", and "cpu") and performs all the operations on float32 precision. To speed up inference and memory footprint, one can use the float16 representation which usually leads to minimal loss in performance. The model-settings.json for a model loaded in float16 looks like this:

!cat ./models/embed-half/model-settings.json
{
    "name": "embed-half",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cuda",
                    "model_kwargs": {
                        "torch_dtype": "float16"
                    }
                }
            }
        }
    }
}

The corresponding manifest file is:

!cat ./manifests/embed-half.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-half
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-half"
  requirements:
  - local-embeddings

We deploy the model as before by running the command:

!kubectl apply -f manifests/embed-half.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-half created
model.mlops.seldon.io/embed-half condition met

Once the model is deployed, we can perform inference by running calling the send_request function:

send_request(endpoint=f"http://{get_mesh_ip()}/v2/models/embed-half/infer")
{'id': '81e86fe1-e849-4f6c-bcad-49d2314bb009',
 'model_name': 'embed-half_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP16',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

ONNX backend

SentenceTransformers supports inference through ONNX by converting the model into ONNX format and using ONNX Runtime to run the model. One can choose to run inference either through CPU or GPU.

The model-settings.json file for the ONNX backend looks like this:

!cat ./models/embed-onnx/model-settings.json
{
    "name": "embed-onnx",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "backend": "onnx"
                }
            }
        }
    }
}

The corresponding manifest file is:

!cat ./manifests/embed-onnx.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-onnx
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-onnx"
  requirements:
  - local-embeddings

To deploy the model, run the following command:

!kubectl apply -f manifests/embed-onnx.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-onnx created
model.mlops.seldon.io/embed-half condition met
model.mlops.seldon.io/embed-onnx condition met

We can test the model by sending an inference request:

send_request(endpoint = f"http://{get_mesh_ip()}/v2/models/embed-onnx/infer")
{'id': '25f8172c-9b10-4b82-be11-c9bd58ee2ca5',
 'model_name': 'embed-onnx_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

OpenVINO

Finally, we will showcase how to use OpenVINO to accelerate inference on CPU by exporting the model to the OpenVINO format.

The model-settings.json for OpenVINO looks like this:

!cat ./models/embed-openvino/model-settings.json
{
    "name": "embed-openvino",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "backend": "openvino"
                }
            }
        }
    }
}

The associated manifest file is:

!cat ./manifests/embed-openvino.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-openvino
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-openvino"
  requirements:
  - local-embeddings

We deploy the model as before:

!kubectl apply -f manifests/embed-openvino.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-openvino created
model.mlops.seldon.io/embed-half condition met
model.mlops.seldon.io/embed-onnx condition met
model.mlops.seldon.io/embed-openvino condition met

To test the model, we send an inference request:

send_request(endpoint=f"http://{get_mesh_ip()}/v2/models/embed-openvino/infer")
{'id': '405e1bb0-d4ec-45be-a0be-83c7d00fd77d',
 'model_name': 'embed-openvino_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

For more details on the backends, please check the SentenceTransformers page on Speeding up Inference.

To unload all the models, run the following commands:

!kubectl delete -f manifests/embed-half.yaml -n seldon
!kubectl delete -f manifests/embed-onnx.yaml -n seldon
!kubectl delete -f manifests/embed-openvino.yaml -n seldon
model.mlops.seldon.io "embed-half" deleted
model.mlops.seldon.io "embed-onnx" deleted
model.mlops.seldon.io "embed-openvino" deleted

Last updated

Was this helpful?