Triton GPT2 Example Azure

In this notebook, we will run an example of text generation using a GPT-2 model exported from HuggingFace and deployed with Seldon's Triton pre-packed server. The example also covers converting the model to ONNX format. The implemented example below uses a Greedy approach for next token prediction.

For more info, see the HuggingFace GPT-2 documentation.

After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module's inference performance.

Steps

Basic Requirements

  • Helm v3.0.0+

  • A Kubernetes cluster running v1.13 or above

  • kubectl v1.14+

  • Python 3.6+

First, create a requirements.txt file:

Now, install the dependencies:

Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

from transformers import GPT2Tokenizer, TFGPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained("gpt2")model = TFGPT2LMHeadModel.from_pretrained( "gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id)model.save_pretrained("./tfgpt2model", saved_model=True)

Convert the TensorFlow saved model to ONNX

python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 13 --output model.onnxAzure Setup

We have provided an Azure Setup Notebook that deploys an AKS cluster, an Azure storage account, and installs the Azure Blob CSI driver. If an AKS cluster already exists, skip to the creation of Blob Storage and CSI driver installation steps.

Upon completion of the Azure setup, the following infrastructure will be created: Azure

Copy your model to Azure Blob

Add Azure PersistentVolume and Claim

For more details on creating a PersistentVolume using the CSI driver, refer to the official documentation.

  • Create secret

  • Create a PersistentVolume pointing to the secret and Blob Container Name

  • Create a PersistentVolumeClaim to bind to the volume

Create a file named azure-blobfuse-pv.yaml:

Run Seldon in your kubernetes cluster

Follow the Seldon-Core Setup notebook to set up a cluster with Istio Ingress and install Seldon Core.

Deploy your model with Seldon pre-packaged Triton server

Create a file named gpt2-deploy.yaml:

Interact with the model: get model metadata

Run prediction test: generate a sentence completion using GPT2 model - Greedy approach

Configure Model Monitoring with Azure Monitor

The Azure Monitor Containers Insights provides functionality to allow collecting data from any Prometheus endpoints. To turn on Azure Monitor Container Insights, follow the steps described here.

Configure Prometheus Metrics scraping

For more details on how to configure the scraping endpoints and query collected data refer to MS Docs on Configure scraping of Prometheus metrics with Container insights.

Our deployed model metrics are available from the Seldon model orchestrator and Nvidia Triton Server. To enable scraping for both endpoints, update the ConfigMap that configures omsagent (azure-metrics-cm.yaml).

Query and Visualize collected data

Collected metrics are available in the Logs blade of Azure Monitor in a table InsightsMetrics.

To get Model Inference Requests per minute from Seldon Metrics, run the following KQL query:

To get Inference Duration from Triton Metrics:

Here is an example dashboard created using the queries above:

dashboard

Run Load Test / Performance Test using vegeta

Install vegeta

For more details, see the official vegeta documentation.

Generate vegeta target file

Clean-up

kubectl delete -f gpt2-deploy.yaml -n default

Last updated

Was this helpful?