Example

In this tutorial, we demonstrate how to run the Local Embeddings MLServer runtime instance locally to deploy an embedding model on your own infrastructure. We will also showcase different configurations in which the runtime can be used.

To begin with, we pull the runtime Docker Image. To pull the Docker image, you must be authenticated. Check our installation tutorial to see how to authenticate with the Docker CLI.

docker pull \
 europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0

To serve a local embedding model, we need first to write the associated model-settings.json file:

!cat models/embed/model-settings.json
{
    "name": "embed",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cpu"
                }
            }
        }
    }
}

As we can see in the model-settings.json we will use the SentenceTransformers backend and load the "all-MiniLM-L6-v2" model on the CPU. This is probably one of the simplest ways to configure your embedding model through our runtime.

Starting the Runtime

Once we have our model-settings.json file defined, we can start serving the model by running the following command:

docker run -it --rm -p 8080:8080 \
  -v ${PWD}/models:/models \
  -e HF_TOKEN=<your_hf_token> \
  --shm-size=1g \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0 \
  mlserver start /models/embed

Sending Requests

To send a request to our model, use the following script:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                "The weather is lovely today.",
                "It's so sunny outside!",
                "He drove to the stadium.",
            ],
        }
    ]
}


endpoint = "http://localhost:8080/v2/models/embed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': 'c08a27b4-b1ed-46af-8be8-74e8058f6877',
 'model_name': 'embed',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

Note that the inference response is a numpy array of dimension 3 x 384, which means that we successfully managed to embed the three sentences we sent in the inference request.

Deploying on Seldon Core 2

We will continue our tutorial by demonstrating how to deploy the local embedding models with Seldon Core 2 on your k8s cluster. Before starting this section, please ensure you have the local-embedding server up and running. Check the installation guidelines on how to set up the local-embedding server.

We begin by defining a model-settings.json file which will describe the same model as above, with the only difference that now we will load the model on GPU instead of CPU:

!cat ./models/embed-gpu/model-settings.json
{
    "name": "embed-gpu",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cuda"
                }
            }
        }
    }
}

Besides the model-settings.json file, we will need to define an associated manifest file:

!cat manifests/embed-gpu.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-gpu
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-gpu"
  requirements:
  - local-embeddings

To load the model on Seldon Core 2 (SC2), run the following command:

!kubectl apply -f manifests/embed-gpu.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/embed-gpu created
^C

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct IP:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

As before, we can now send a request to the model to compute the embeddings for the given sentences:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                "The weather is lovely today.",
                "It's so sunny outside!",
                "He drove to the stadium.",
            ],
        }
    ]
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
{'id': '7a00ca5e-0f4d-4453-8451-73d0a417eb13',
 'model_name': 'embed-gpu_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

To unload the model, run the following command:

!kubectl delete -f manifests/embed-gpu.yaml -n seldon
model.mlops.seldon.io "embed-half" deleted
model.mlops.seldon.io "embed-onnx" deleted
model.mlops.seldon.io "embed-openvino" deleted

Congrats, you've just deployed a local text embedding model!

Last updated

Was this helpful?