Release Notes
Version 0.1.0
Overview
The Model Performance Metrics Module provides performance evaluation for machine learning models deployed in Seldon Core 2. It captures inference responses from Kafka, stores them in PostgreSQL, and calculates model performance metrics using user-provided feedback (ground truth). Metrics are computed over configurable time windows, with results accessible via dedicated API endpoints or visualized in Grafana.
Key Features
Inference Response Processing: Captures inference responses from Seldon Core 2 models via Kafka and stores them in PostgreSQL for metric computation.
Feedback Integration: Computes metrics based on inference responses and user-provided feedback (ground truth).
Classification Metrics: Accuracy, Precision, Recall, Specificity, F1-score, and Confusion Matrix for binary and multiclass classification.
Regression Metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
Time Windowing: Aggregates metrics over configurable time intervals, with a maximum of 100 time buckets per window.
Model Subscriptions: Models must be subscribed to enable metric computation. Subscriptions define the model’s output schema, ensuring correct parsing of inference responses and feedback.
API Access: Metrics are accessible via API endpoints for programmatic retrieval.
Grafana Support: Metrics can be visualized using Grafana dashboards, which can be configured with the Infinity plugin.
Known Limitations
Only models deployed within a Seldon Core 2 Pipeline can be subscribed to and evaluated. Multi-model pipelines are not supported.
Multilabel classification is not supported; only binary and multiclass classification are available.
The number of buckets within a time window is limited to 100. If an interval results in more than 100 buckets, users must adjust the time window or bucket size.
Usage Notes & Best Practices
Feedback may not always reflect the true ground truth distribution, as incorrect predictions are more likely to be reported than correct ones.
A low volume of feedback can lead to statistically insignificant metrics, making trends unreliable.
If a model artifact is updated at the same URI, the module will not detect the change. Metrics computed before and after the change may be misleading.
If no inference responses fall into a given bucket, the API returns
-1
for numerical metrics (Accuracy, Recall, Precision, Specificity, F1, MSE, MAE, RMSE) and an empty array for the confusion matrix.Inference responses that lack required metadata (e.g., pipeline name, model name, timestamp) will be ignored and not stored.
Last updated
Was this helpful?