githubEdit

Monitoring

In this example we demonstrate how to use the mointoring functionality of the Local runtime. The metrics logging is done through Prometheus/Grafana stack. To run this tutorial, you need to have the following up and running:

Please check our installation tutorial to get the Local Runtime server up and running.

We will deploy a phi-3.5 model using the Local Runtime, send multiple inference requests, and then inspect various metric in the Grafana dashboard such as: E2E request latency, token throughput, time per output token latency, time to first token latency, cache utilization, scheduler state, etc.

circle-info

Note that all backends are supported for the Local runtime. Labels are provided for each backend, thus in the Grafana dashboard you will be able to switch between them.

We will use the following model-settings.json file:

!cat models/local-chat-completions/model-settings.json
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}

The associated CRD is the following:

To load the model on our server, run the following command:

At this point, the model should be loaded.

Before sending the inference requests, let us see how we can connect to grafana. When installing SC2 thorugh ansible, it should automatically create the grafana deployment and prometheus stateful-set under the seldon-monitoring namespace. Thus, to connect to grafana all we have to do is to create a port-forwarding from localhost:3000 to the container 3000 port.

To do so, open a new terminal and run the following command:

or feel free to do the port-forwarding directly from a tool like k9s.

circle-info

Note that the termination of your pod is going to be different than 7cf6c5c665-cs9g7. Please modify the command above such that it matches your local configuration.

Now, open a browser and access http://localhost:3000. This should open grafana for you.

circle-info

In case you are working on a remote machine, you might not be able to access grafana from your browser. To do so, you can do ssh-tunneling running the following command:

Now you should be able to access grafana from your browser.

At this point, you are asked for credentials to login into grafan. The credentials can be found under the grafana secret in seldon-monitoring namespace. To get the user and the password, run the following commands:

In this case, the user is admin and the password is 304614. You should now see something like:

Go to Dashboards -> New -> Import.

This should open a form to upload the grafana-llmis.json file that you downloaded.

You can name your dashboard anything you want. For the sake of this tutorial we will call it Seldon Core 2 LLM. Once loaded, you should see something like:

Note that all the plots are empty since we haven't sent any request. We are now ready to send multiple requests. The following script will simulate 100 request with an average interval window of 0.1 seconds in between them, and the maximum number of tokens to be generated sampled uniformly between 50 and 250.

Going back to the grafana dashboard, you should see now something like:

For more information about the metrics, see list of metricsarrow-up-right

To unload the model, run the following command:

Last updated

Was this helpful?