# Example

In this tutorial, we demonstrate how to run the Local Embeddings MLServer runtime instance locally to deploy an embedding model on your own infrastructure. We will also showcase different configurations in which the runtime can be used.

To begin with, we pull the runtime Docker Image. To pull the Docker image, you must be authenticated. Check our [installation tutorial](https://github.com/SeldonIO/llm-runtimes/blob/master/getting-started/installation.md) to see how to authenticate with the Docker CLI.

```bash
docker pull \
 europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0
```

To serve a local embedding model, we need first to write the associated `model-settings.json` file:

```python
!cat models/embed/model-settings.json
```

```
{
    "name": "embed",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cpu"
                }
            }
        }
    }
}
```

As we can see in the `model-settings.json` we will use the [SentenceTransformers](https://sbert.net/) backend and load the `"all-MiniLM-L6-v2"` model on the CPU. This is probably one of the simplest ways to configure your embedding model through our runtime.

## Starting the Runtime

Once we have our `model-settings.json` file defined, we can start serving the model by running the following command:

```bash
docker run -it --rm -p 8080:8080 \
  -v ${PWD}/models:/models \
  -e HF_TOKEN=<your_hf_token> \
  --shm-size=1g \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0 \
  mlserver start /models/embed
```

## Sending Requests

To send a request to our model, use the following script:

```python
import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                "The weather is lovely today.",
                "It's so sunny outside!",
                "He drove to the stadium.",
            ],
        }
    ]
}


endpoint = "http://localhost:8080/v2/models/embed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
```

```
{'id': 'c08a27b4-b1ed-46af-8be8-74e8058f6877',
 'model_name': 'embed',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}
```

Note that the inference response is a `numpy` array of dimension `3 x 384`, which means that we successfully managed to embed the three sentences we sent in the inference request.

## Deploying on Seldon Core 2

We will continue our tutorial by demonstrating how to deploy the local embedding models with Seldon Core 2 on your k8s cluster. Before starting this section, please ensure you have the `local-embedding` server up and running. Check the [installation guidelines](https://github.com/SeldonIO/llm-runtimes/blob/master/getting-started/installation.md) on how to set up the `local-embedding` server.

We begin by defining a `model-settings.json` file which will describe the same model as above, with the only difference that now we will load the model on GPU instead of CPU:

```python
!cat ./models/embed-gpu/model-settings.json
```

```
{
    "name": "embed-gpu",
    "implementation": "mlserver_local_embeddings.LocalEmbeddingsRuntime",
    "parameters": {
        "extra": {
            "backend": "sentence-transformers",
            "config": {
                "model_settings": {
                    "model_name_or_path": "all-MiniLM-L6-v2",
                    "device": "cuda"
                }
            }
        }
    }
}
```

Besides the `model-settings.json` file, we will need to define an associated manifest file:

```python
!cat manifests/embed-gpu.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: embed-gpu
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/embeddings/models/embed-gpu"
  requirements:
  - local-embeddings
```

To load the model on Seldon Core 2 (SC2), run the following command:

```python
!kubectl apply -f manifests/embed-gpu.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/embed-gpu created
^C
```

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct IP:

```python
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')
```

As before, we can now send a request to the model to compute the embeddings for the given sentences:

```python
import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                "The weather is lovely today.",
                "It's so sunny outside!",
                "He drove to the stadium.",
            ],
        }
    ]
}

endpoint = f"http://{get_mesh_ip()}/v2/models/embed-gpu/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=3)
```

```
{'id': '7a00ca5e-0f4d-4453-8451-73d0a417eb13',
 'model_name': 'embed-gpu_1',
 'model_version': '1',
 'outputs': [{'data': [...],
              'datatype': 'FP32',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/embed-gpu.yaml -n seldon
```

```
model.mlops.seldon.io "embed-half" deleted
model.mlops.seldon.io "embed-onnx" deleted
model.mlops.seldon.io "embed-openvino" deleted
```

Congrats, you've just deployed a local text embedding model!


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/embeddings/example.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
