Monitoring

In this example we demonstrate how to use the mointoring functionality of the Local runtime. The metrics logging is done through Prometheus/Grafana stack. To run this tutorial, you need to have the following up and running:

Seldon Core 2 installed with Prometheus and Grafana
The Local Runtime
Download the grafana-llmis.json from the Github repository

Please check our installation tutorial to get the Local Runtime server up and running.

We will deploy a phi-3.5 model using the Local Runtime, send multiple inference requests, and then inspect various metric in the Grafana dashboard such as: E2E request latency, token throughput, time per output token latency, time to first token latency, cache utilization, scheduler state, etc.

Note that all backends are supported for the Local runtime. Labels are provided for each backend, thus in the Grafana dashboard you will be able to switch between them.

We will use the following model-settings.json file:

!cat models/local-chat-completions/model-settings.json

{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}

The associated CRD is the following:

!cat manifests/local-chat-completions.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/monitoring/local/models/local-chat-completions"
  requirements:
  - llm-local

To load the model on our server, run the following command:

!kubectl apply -f manifests/local-chat-completions.yaml -n seldon

model.mlops.seldon.io/local-chat-completions created

At this point, the model should be loaded.

Before sending the inference requests, let us see how we can connect to grafana. When installing SC2 thorugh ansible, it should automatically create the grafana deployment and prometheus stateful-set under the seldon-monitoring namespace. Thus, to connect to grafana all we have to do is to create a port-forwarding from localhost:3000 to the container 3000 port.

To do so, open a new terminal and run the following command:

kubectl port-forward grafana-7cf6c5c665-cs9g7 3000:3000 -n seldon-monitoring

or feel free to do the port-forwarding directly from a tool like k9s.

Note that the termination of your pod is going to be different than 7cf6c5c665-cs9g7. Please modify the command above such that it matches your local configuration.

Now, open a browser and access http://localhost:3000. This should open grafana for you.

In case you are working on a remote machine, you might not be able to access grafana from your browser. To do so, you can do ssh-tunneling running the following command:

ssh -L 3000:localhost:8080 user@remote_server

Now you should be able to access grafana from your browser.

At this point, you are asked for credentials to login into grafan. The credentials can be found under the grafana secret in seldon-monitoring namespace. To get the user and the password, run the following commands:

!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-user}" | base64 --decode ; echo

admin

!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

r-296523

In this case, the user is admin and the password is 304614. You should now see something like:

Go to Dashboards -> New -> Import.

This should open a form to upload the grafana-llmis.json file that you downloaded.

You can name your dashboard anything you want. For the sake of this tutorial we will call it Seldon Core 2 LLM. Once loaded, you should see something like:

Note that all the plots are empty since we haven't sent any request. We are now ready to send multiple requests. The following script will simulate 100 request with an average interval window of 0.1 seconds in between them, and the maximum number of tokens to be generated sampled uniformly between 50 and 250.

!pip install aiohttp -q

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

import asyncio
import aiohttp
import random


async def send_request(endpoint, inference_request, session, request_num):
    try:
        async with session.post(endpoint, json=inference_request) as response:
            _ = await response.text()
            print(f"Request {request_num}: Status {response.status}")
    except Exception as e:
        print(f"Request {request_num} failed: {e}")
        return None


async def main(endpoint, inference_request, n=100):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in range(n):
            _inference_request = inference_request.copy()
            _inference_request["parameters"] = {"kwargs": {"max_tokens": random.randint(50, 250), "ignore_eos": True}}
            
            tasks.append(send_request(endpoint, _inference_request, session, i))
            await asyncio.sleep(random.uniform(0.08, 0.12))
            
        await asyncio.gather(*tasks)


inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "content",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["What is the tallest building in the world?"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "type",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["text"],
            "parameters": {"content_type": "str"},
        }
    ]
}

# Run the event loop
endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer"
await main(endpoint, inference_request)

Request 6: Status 200
Request 2: Status 200
Request 13: Status 200
Request 4: Status 200
Request 9: Status 200
Request 15: Status 200
Request 18: Status 200
Request 8: Status 200
Request 5: Status 200
Request 14: Status 200
Request 20: Status 200
Request 17: Status 200
Request 16: Status 200
Request 1: Status 200
Request 26: Status 200
Request 0: Status 200
Request 24: Status 200
Request 19: Status 200
Request 7: Status 200
Request 27: Status 200
Request 3: Status 200
Request 23: Status 200
Request 12: Status 200
Request 21: Status 200
Request 10: Status 200
Request 11: Status 200
Request 28: Status 200
Request 30: Status 200
Request 40: Status 200
Request 42: Status 200
Request 45: Status 200
Request 32: Status 200
Request 29: Status 200
Request 25: Status 200
Request 33: Status 200
Request 31: Status 200
Request 34: Status 200
Request 43: Status 200
Request 37: Status 200
Request 38: Status 200
Request 22: Status 200
Request 53: Status 200
Request 74: Status 200
Request 85: Status 200
Request 35: Status 200
Request 41: Status 200
Request 72: Status 200
Request 49: Status 200
Request 39: Status 200
Request 88: Status 200
Request 66: Status 200
Request 58: Status 200
Request 89: Status 200
Request 57: Status 200
Request 36: Status 200
Request 46: Status 200
Request 52: Status 200
Request 44: Status 200
Request 82: Status 200
Request 61: Status 200
Request 51: Status 200
Request 77: Status 200
Request 76: Status 200
Request 81: Status 200
Request 64: Status 200
Request 65: Status 200
Request 73: Status 200
Request 71: Status 200
Request 70: Status 200
Request 56: Status 200
Request 78: Status 200
Request 86: Status 200
Request 91: Status 200
Request 84: Status 200
Request 50: Status 200
Request 69: Status 200
Request 80: Status 200
Request 62: Status 200
Request 59: Status 200
Request 54: Status 200
Request 75: Status 200
Request 79: Status 200
Request 83: Status 200
Request 48: Status 200
Request 60: Status 200
Request 55: Status 200
Request 96: Status 200
Request 87: Status 200
Request 93: Status 200
Request 47: Status 200
Request 92: Status 200
Request 90: Status 200
Request 68: Status 200
Request 63: Status 200
Request 67: Status 200
Request 98: Status 200
Request 94: Status 200
Request 99: Status 200
Request 97: Status 200
Request 95: Status 200

Going back to the grafana dashboard, you should see now something like:

For more information about the metrics, see list of metrics

To unload the model, run the following command:

!kubectl delete -f manifests/local-chat-completions.yaml -n seldon

model.mlops.seldon.io "local-chat-completions" deleted

PreviousPlanning NextRouting (with LiteLLM)

Last updated 3 months ago

Was this helpful?