Setup for Triton GPT2 Example Azure

Setup Azure Kubernetes Infrastructure

In this notebook we will:


Define Variables

Set variables required for the project:

subscription_id = "<xxxx-xxxx-xxxx-xxxx>"  # fill in
resource_group = "seldon"  # feel free to replace or use this default
region = "eastus2"  # feel free to replace or use this default

storage_account_name = "modeltestsgpt"  # fill in
storage_container_name = "gpt2tf"

aks_name = "modeltests"  # feel free to replace or use this default
aks_gpupool = "gpunodes"  # feel free to replace or use this default
aks_cpupool = "cpunodes"  # feel free to replace or use this default
aks_gpu_sku = "Standard_NC6s_v3"  # feel free to replace or use this default
aks_cpu_sku = "Standard_F8s_v2"

Azure account login

If you are not already logged in to an Azure account, the command below will initiate a login. This will pop up a browser where you can select your login. (If no web browser is available or if the web browser fails to open, use device code flow with az login --use-device-code or login in WSL command prompt and proceed to notebook)


Create Resource Group

Azure encourages the use of groups to organize all the Azure components you deploy. That way it is easier to find them but also we can delete a number of resources simply by deleting the group.


Create AKS Cluster and NodePools

Below, we create the AKS cluster with default 1 system node (to save time, in production use more nodes as per best practices) in the resource group we created earlier. This step can take 5 or more minutes.


Connect to AKS Cluster

To configure kubectl to connect to Kubernetes cluster, run the following command:

Let's verify connection by listing the nodes:

Example output:


Taint System node with CriticalAddonsOnly taint so it is available only for system workloads:


Create GPU enabled and CPU Node Pools

To create GPU enabled nodepool, will use fully configured AKS image that contains the NVIDIA device plugin for Kubernetes, see Use the AKS specialized GPU image (preview). Creating nodepools could take five or more minutes.


Create GPU NodePool with GPU taint

For more information on Azure Nodepools: https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools


Verify GPU is available on Kubernetes Node

Now use the kubectl describe node command to confirm that the GPUs are schedulable. Under the Capacity section, for Standard_NC12 sku the GPU should list as nvidia.com/gpu: 2


Create CPU NodePool for running regular workloads


Verify Taints on the Kubernetes nodes

Verify that system pool and have the Taints CriticalAddonsOnly and sku=gpu respectively:


Create Storage Account for training data

In this section of the notebook, we'll create an Azure blob storage that we'll use throughout the tutorial. This object store will be used to store input images and save checkpoints. Use az cli to create the account.

Grab the keys of the storage account that was just created. We would need them for binding Kubernetes Persistent Volume. The --quote '[0].value' part of the command simply means to select the value of the zero-th indexed of the set of keys.

The stdout from the command above is stored in a string array of 1. Select the element in the array.


Install Kubernetes Blob CSI Driver

Azure Blob Storage CSI driver for Kubernetes allows Kubernetes to access Azure Storage. We will deploy it using Helm3 package manager as described in the docs https://github.com/kubernetes-sigs/blob-csi-driver/tree/master/charts


Create Persistent Volume for Azure Blob

For more details on creating PersistentVolume using CSI driver refer to https://github.com/kubernetes-sigs/blob-csi-driver/blob/master/deploy/example/e2e_usage.md

Persistent Volume YAML definition is in azure-blobfules-pv.yaml with fields pointing to secret created above and containername we created in storage account:

Create the file azure-blobfuse-pv.yaml:

Example output:


In the end of this step you will have AKS cluster and Storage account in resource group. AKS cluster will have cpu and gpu nodepools in addition to system nodepool.

Last updated

Was this helpful?