githubEdit

Example

In this tutorial, we demonstrate how to run the Local Embeddings MLServer runtime instance locally to deploy an embedding model on your own infrastructure. We will also showcase different configurations in which the runtime can be used.

To begin with, we pull the runtime Docker Image. To pull the Docker image, you must be authenticated. Check our installation tutorialarrow-up-right to see how to authenticate with the Docker CLI.

docker pull \
 europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0

To serve a local embedding model, we need first to write the associated model-settings.json file:

!cat models/embed/model-settings.json
{
    "name": "embed",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cpu"
                }
            }
        }
    }
}

As we can see in the model-settings.json we will use the SentenceTransformersarrow-up-right backend and load the "all-MiniLM-L6-v2" model on the CPU. This is probably one of the simplest ways to configure your embedding model through our runtime.

Starting the Runtime

Once we have our model-settings.json file defined, we can start serving the model by running the following command:

Sending Requests

To send a request to our model, use the following script:

Note that the inference response is a numpy array of dimension 3 x 384, which means that we successfully managed to embed the three sentences we sent in the inference request.

Deploying on Seldon Core 2

We will continue our tutorial by demonstrating how to deploy the local embedding models with Seldon Core 2 on your k8s cluster. Before starting this section, please ensure you have the local-embedding server up and running. Check the installation guidelinesarrow-up-right on how to set up the local-embedding server.

We begin by defining a model-settings.json file which will describe the same model as above, with the only difference that now we will load the model on GPU instead of CPU:

Besides the model-settings.json file, we will need to define an associated manifest file:

To load the model on Seldon Core 2 (SC2), run the following command:

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct IP:

As before, we can now send a request to the model to compute the embeddings for the given sentences:

To unload the model, run the following command:

Congrats, you've just deployed a local text embedding model!

Last updated

Was this helpful?