Monitoring
In this example we demonstrate how to use the mointoring functionality of the Local runtime. The metrics logging is done through Prometheus/Grafana stack. To run this tutorial, you need to have the following up and running:
Seldon Core 2 installed with Prometheus and Grafana
The Local Runtime
Please check our installation tutorial to get the Local Runtime server up and running.
We will deploy a phi-3.5
model using the Local Runtime, send multiple inference requests, and then inspect various metric in the Grafana dashboard such as: E2E request latency, token throughput, time per output token latency, time to first token latency, cache utilization, scheduler state, etc.
We will use the following model-settings.json
file:
!cat models/local-chat-completions/model-settings.json
{
"name": "local-chat-completions",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "microsoft/Phi-3.5-mini-instruct",
"tensor_parallel_size": 4,
"dtype": "float16",
"gpu_memory_utilization": 0.8,
"max_model_len": 4096,
"default_generate_kwargs": {
"max_tokens": 1024
}
}
}
}
}
}
The associated CRD is the following:
!cat manifests/local-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: local-chat-completions
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/monitoring/local/models/local-chat-completions"
requirements:
- llm-local
To load the model on our server, run the following command:
!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
model.mlops.seldon.io/local-chat-completions created
At this point, the model should be loaded.
Before sending the inference requests, let us see how we can connect to grafana. When installing SC2 thorugh ansible, it should automatically create the grafana
deployment and prometheus
stateful-set under the seldon-monitoring
namespace. Thus, to connect to grafana
all we have to do is to create a port-forwarding from localhost:3000
to the container 3000
port.
To do so, open a new terminal and run the following command:
kubectl port-forward grafana-7cf6c5c665-cs9g7 3000:3000 -n seldon-monitoring
or feel free to do the port-forwarding directly from a tool like k9s.
Now, open a browser and access http://localhost:3000
. This should open grafana
for you.
At this point, you are asked for credentials to login into grafan. The credentials can be found under the grafana
secret in seldon-monitoring
namespace. To get the user and the password, run the following commands:
!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-user}" | base64 --decode ; echo
admin
!kubectl get secret --namespace seldon-monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
r-296523
In this case, the user is admin
and the password is 304614
. You should now see something like:
Download the LLMIS dashboard using the script below:
!pip install google-cloud-storage -q
import os
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from a public bucket without credentials."""
# Ensure the destination directory exists
if not os.path.exists(os.path.dirname(destination_file_name)):
os.makedirs(os.path.dirname(destination_file_name))
# Initialize a storage client without credentials
storage_client = storage.Client.create_anonymous_client()
# Retrieve the bucket and blob (file)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
# Download the blob to a local file
blob.download_to_filename(destination_file_name)
print(f"Downloaded {source_blob_name} to {destination_file_name}")
# Example usage
bucket_name = "seldon-models"
source_blob_name = "llm-runtimes/examples/monitoring/local/assets/grafana-llmis.json"
destination_file_name = "assets/grafana-llmis.json"
download_blob(bucket_name, source_blob_name, destination_file_name)
Downloaded llm-runtimes/examples/monitoring/local/assets/grafana-llmis.json to assets/grafana-llmis.json
Once downloaded, go to Dashboards -> New -> Import
.
This should open a form to upload the grafana-llmis.json
file that you just downloaded.
You can name your dashboard anything you want. For the sake of this tutorial we will call it Seldon Core 2 LLM
. Once loaded, you should see something like:
Note that all the plots are empty since we haven't sent any request. We are now ready to send multiple requests. The following script will simulate 100 request with an average interval window of 0.1 seconds in between them, and the maximum number of tokens to be generated sampled uniformly between 50 and 250.
!pip install aiohttp -q
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
import asyncio
import aiohttp
import random
async def send_request(endpoint, inference_request, session, request_num):
try:
async with session.post(endpoint, json=inference_request) as response:
_ = await response.text()
print(f"Request {request_num}: Status {response.status}")
except Exception as e:
print(f"Request {request_num} failed: {e}")
return None
async def main(endpoint, inference_request, n=100):
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(n):
_inference_request = inference_request.copy()
_inference_request["parameters"] = {"kwargs": {"max_tokens": random.randint(50, 250), "ignore_eos": True}}
tasks.append(send_request(endpoint, _inference_request, session, i))
await asyncio.sleep(random.uniform(0.08, 0.12))
await asyncio.gather(*tasks)
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1, 1],
"datatype": "BYTES",
"data": ["user"],
"parameters": {"content_type": "str"},
},
{
"name": "content",
"shape": [1, 1],
"datatype": "BYTES",
"data": ["What is the tallest building in the world?"],
"parameters": {"content_type": "str"},
},
{
"name": "type",
"shape": [1, 1],
"datatype": "BYTES",
"data": ["text"],
"parameters": {"content_type": "str"},
}
]
}
# Run the event loop
endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer"
await main(endpoint, inference_request)
Request 6: Status 200
Request 2: Status 200
Request 13: Status 200
Request 4: Status 200
Request 9: Status 200
Request 15: Status 200
Request 18: Status 200
Request 8: Status 200
Request 5: Status 200
Request 14: Status 200
Request 20: Status 200
Request 17: Status 200
Request 16: Status 200
Request 1: Status 200
Request 26: Status 200
Request 0: Status 200
Request 24: Status 200
Request 19: Status 200
Request 7: Status 200
Request 27: Status 200
Request 3: Status 200
Request 23: Status 200
Request 12: Status 200
Request 21: Status 200
Request 10: Status 200
Request 11: Status 200
Request 28: Status 200
Request 30: Status 200
Request 40: Status 200
Request 42: Status 200
Request 45: Status 200
Request 32: Status 200
Request 29: Status 200
Request 25: Status 200
Request 33: Status 200
Request 31: Status 200
Request 34: Status 200
Request 43: Status 200
Request 37: Status 200
Request 38: Status 200
Request 22: Status 200
Request 53: Status 200
Request 74: Status 200
Request 85: Status 200
Request 35: Status 200
Request 41: Status 200
Request 72: Status 200
Request 49: Status 200
Request 39: Status 200
Request 88: Status 200
Request 66: Status 200
Request 58: Status 200
Request 89: Status 200
Request 57: Status 200
Request 36: Status 200
Request 46: Status 200
Request 52: Status 200
Request 44: Status 200
Request 82: Status 200
Request 61: Status 200
Request 51: Status 200
Request 77: Status 200
Request 76: Status 200
Request 81: Status 200
Request 64: Status 200
Request 65: Status 200
Request 73: Status 200
Request 71: Status 200
Request 70: Status 200
Request 56: Status 200
Request 78: Status 200
Request 86: Status 200
Request 91: Status 200
Request 84: Status 200
Request 50: Status 200
Request 69: Status 200
Request 80: Status 200
Request 62: Status 200
Request 59: Status 200
Request 54: Status 200
Request 75: Status 200
Request 79: Status 200
Request 83: Status 200
Request 48: Status 200
Request 60: Status 200
Request 55: Status 200
Request 96: Status 200
Request 87: Status 200
Request 93: Status 200
Request 47: Status 200
Request 92: Status 200
Request 90: Status 200
Request 68: Status 200
Request 63: Status 200
Request 67: Status 200
Request 98: Status 200
Request 94: Status 200
Request 99: Status 200
Request 97: Status 200
Request 95: Status 200
Going back to the grafana
dashboard, you should see now something like:
To unload the model, run the following command:
!kubectl delete -f manifests/local-chat-completions.yaml -n seldon
model.mlops.seldon.io "local-chat-completions" deleted
Last updated
Was this helpful?