# Monitoring

In this example we demonstrate how to use the mointoring functionality of the Local runtime. The metrics logging is done through Prometheus/Grafana stack. To run this tutorial, you need to have the following up and running:

* Seldon Core 2 installed with Prometheus and Grafana
* The Local Runtime
* Download the `grafana-llmis.json` from the [Github repository](https://github.com/SeldonIO/seldon-core/tree/v2/prometheus/dashboards)

Please check our [installation tutorial](/llm-module/introduction/installation.md) to get the Local Runtime server up and running.

We will deploy a `phi-3.5` model using the Local Runtime, send multiple inference requests, and then inspect various metric in the Grafana dashboard such as: E2E request latency, token throughput, time per output token latency, time to first token latency, cache utilization, scheduler state, etc.

{% hint style="info" %}
Note that all backends are supported for the Local runtime. Labels are provided for each backend, thus in the Grafana dashboard you will be able to switch between them.
{% endhint %}

We will use the following `model-settings.json` file:

```python
!cat models/local-chat-completions/model-settings.json
```

```
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}
```

The associated CRD is the following:

```python
!cat manifests/local-chat-completions.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/monitoring/local/models/local-chat-completions"
  requirements:
  - llm-local
```

To load the model on our server, run the following command:

```python
!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
```

```
model.mlops.seldon.io/local-chat-completions created
```

At this point, the model should be loaded.

Before sending the inference requests, let us see how we can connect to grafana. When installing SC2 thorugh ansible, it should automatically create the `grafana` deployment and `prometheus` stateful-set under the `seldon-monitoring` namespace. Thus, to connect to `grafana` all we have to do is to create a port-forwarding from `localhost:3000` to the container `3000` port.

To do so, open a new terminal and run the following command:

```bash
kubectl port-forward grafana-7cf6c5c665-cs9g7 3000:3000 -n seldon-monitoring
```

or feel free to do the port-forwarding directly from a tool like k9s.

{% hint style="info" %}
Note that the termination of your pod is going to be different than `7cf6c5c665-cs9g7`. Please modify the command above such that it matches your local configuration.
{% endhint %}

Now, open a browser and access `http://localhost:3000`. This should open `grafana` for you.

{% hint style="info" %}
In case you are working on a remote machine, you might not be able to access `grafana` from your browser. To do so, you can do ssh-tunneling running the following command:

```bash
ssh -L 3000:localhost:8080 user@remote_server
```

Now you should be able to access `grafana` from your browser.
{% endhint %}

At this point, you are asked for credentials to login into grafan. The credentials can be found under the `grafana` secret in `seldon-monitoring` namespace. To get the user and the password, run the following commands:

```python
!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-user}" | base64 --decode ; echo
```

```
admin
```

```python
!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
```

```
r-296523
```

In this case, the user is `admin` and the password is `304614`. You should now see something like:

![](/files/JIhf5WeQBKhVagX9Dkjc)

Go to `Dashboards -> New -> Import`.

![](/files/n2plXeeFT6KV9LKMGPPc)

This should open a form to upload the `grafana-llmis.json` file that you downloaded.

![](/files/lhoeHZMfO69i1MgvhGor)

You can name your dashboard anything you want. For the sake of this tutorial we will call it `Seldon Core 2 LLM`. Once loaded, you should see something like:

![](/files/zIPWlideYoi8qYv4HbbS)

Note that all the plots are empty since we haven't sent any request. We are now ready to send multiple requests. The following script will simulate 100 request with an average interval window of 0.1 seconds in between them, and the maximum number of tokens to be generated sampled uniformly between 50 and 250.

```python
!pip install aiohttp -q
```

```python
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

```

```python
import asyncio
import aiohttp
import random


async def send_request(endpoint, inference_request, session, request_num):
    try:
        async with session.post(endpoint, json=inference_request) as response:
            _ = await response.text()
            print(f"Request {request_num}: Status {response.status}")
    except Exception as e:
        print(f"Request {request_num} failed: {e}")
        return None


async def main(endpoint, inference_request, n=100):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in range(n):
            _inference_request = inference_request.copy()
            _inference_request["parameters"] = {"kwargs": {"max_tokens": random.randint(50, 250), "ignore_eos": True}}
            
            tasks.append(send_request(endpoint, _inference_request, session, i))
            await asyncio.sleep(random.uniform(0.08, 0.12))
            
        await asyncio.gather(*tasks)


inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "content",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["What is the tallest building in the world?"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "type",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["text"],
            "parameters": {"content_type": "str"},
        }
    ]
}

# Run the event loop
endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer"
await main(endpoint, inference_request)

```

```
Request 6: Status 200
Request 2: Status 200
Request 13: Status 200
Request 4: Status 200
Request 9: Status 200
Request 15: Status 200
Request 18: Status 200
Request 8: Status 200
Request 5: Status 200
Request 14: Status 200
Request 20: Status 200
Request 17: Status 200
Request 16: Status 200
Request 1: Status 200
Request 26: Status 200
Request 0: Status 200
Request 24: Status 200
Request 19: Status 200
Request 7: Status 200
Request 27: Status 200
Request 3: Status 200
Request 23: Status 200
Request 12: Status 200
Request 21: Status 200
Request 10: Status 200
Request 11: Status 200
Request 28: Status 200
Request 30: Status 200
Request 40: Status 200
Request 42: Status 200
Request 45: Status 200
Request 32: Status 200
Request 29: Status 200
Request 25: Status 200
Request 33: Status 200
Request 31: Status 200
Request 34: Status 200
Request 43: Status 200
Request 37: Status 200
Request 38: Status 200
Request 22: Status 200
Request 53: Status 200
Request 74: Status 200
Request 85: Status 200
Request 35: Status 200
Request 41: Status 200
Request 72: Status 200
Request 49: Status 200
Request 39: Status 200
Request 88: Status 200
Request 66: Status 200
Request 58: Status 200
Request 89: Status 200
Request 57: Status 200
Request 36: Status 200
Request 46: Status 200
Request 52: Status 200
Request 44: Status 200
Request 82: Status 200
Request 61: Status 200
Request 51: Status 200
Request 77: Status 200
Request 76: Status 200
Request 81: Status 200
Request 64: Status 200
Request 65: Status 200
Request 73: Status 200
Request 71: Status 200
Request 70: Status 200
Request 56: Status 200
Request 78: Status 200
Request 86: Status 200
Request 91: Status 200
Request 84: Status 200
Request 50: Status 200
Request 69: Status 200
Request 80: Status 200
Request 62: Status 200
Request 59: Status 200
Request 54: Status 200
Request 75: Status 200
Request 79: Status 200
Request 83: Status 200
Request 48: Status 200
Request 60: Status 200
Request 55: Status 200
Request 96: Status 200
Request 87: Status 200
Request 93: Status 200
Request 47: Status 200
Request 92: Status 200
Request 90: Status 200
Request 68: Status 200
Request 63: Status 200
Request 67: Status 200
Request 98: Status 200
Request 94: Status 200
Request 99: Status 200
Request 97: Status 200
Request 95: Status 200
```

Going back to the `grafana` dashboard, you should see now something like:

![](/files/ysWMh9As0G0DRG5cCf1a)

For more information about the metrics, see [list of metrics](https://docs.seldon.ai/seldon-core-2/user-guide/operational-monitoring/usage#list-of-metrics)

To unload the model, run the following command:

```python
!kubectl delete -f manifests/local-chat-completions.yaml -n seldon
```

```
model.mlops.seldon.io "local-chat-completions" deleted
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/use-cases/monitor-runtime.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
