GKE with GPU Tensorflow Deep MNIST

Please note: This tutorial uses Tensorflow-gpu=1.13.1, CUDA 10.0 and cuDNN 7.6

Requirements: Ubuntu 18.+ and Python 3.6

In this tutorial we will run a deep MNIST Tensorflow example with GPU.

The tutorial will be broken down into the following sections:

  1. Install all dependencies to run Tensorflow-GPU

    1.1 Installing CUDA 10.0

    1.2 Installing cuDNN 7.6

    1.3 Configure CUDA and cuDNN

    1.4 Install Tensorflow GPU

  2. Train the MNIST model locally

  3. Push the Image to your proejcts Container Registry

  4. Deploy the model on GKE using Seldon Core

Local Testing Environment

For the development of this example a GCE Virtual Machine was used to allow access to a GPU. The configuration for this VM is as follows:

  • VM Image: TensorFlow from NVIDIA

  • 8 vCPUs

  • 32 GB memory

  • 1x NVIDIA Tesla V100 GPU

1) Installing all dependencies to run Tensorflow-GPU

  • Dependencies installed in this section:

    • Nvidia compute 3.0 onwards

    • CUDA 10.0

    • cuDNN 7.6

    • tensorflow-gpu 1.13.1

Check Nvidia drivers >= 3.0

1.1) Install CUDA 10.0

  • Download the CUDA 10.0 runfile

  • Unpack the separate files:

  • Install the Cuda 10.0 Toolkit file:

From the terminal, run the following command

Hold 'd' to scroll to the bottom of the license agreement.

Accept the licencing agreement and all of the default settings.

  • Verify the install, by installing the sample test:

Again, accept the agreement and all default settings

  • Configure the runtime library:

  • Add the cuda bin to the file system:

Add ‘:/usr/local/cuda/bin’ to the end of the PATH (inside quotes)

  • Reboot the system

  • Run the tests that we set up - this takes some time to complete, so let it run for a little while...

If run into an error involving the GCC version:

And run again, otherwise, skip this step.

  • After complete, run a devicequery and bandwidth test:

Remember to clean up by removing all of the downloaded runtime packages

1.2) Install cuDNN 7.6

  • Download all 3 .deb files for CUDA10.0 and Ubuntu 18.04

You will have to create a Nvidia account for this and go to the archive section of the cuDNN downloads

Ensure you download all 3 files:

  • Runtime

  • Developer

  • Code Samples

Unpackage the three files in this order

  • Verify the install is successful with the MNIST example

From the download folder. Copy the files to somewhere with write access:

Go to the MNIST example code, compile and run it

Remember to clean up by removing all of the downloaded runtime packages

1.3) Configure CUDA and cuDNN

Add LD_LIBRARY_PATH in your .bashrc file:

Add the following line in the end or your .bashrc file export export:

And source it with:

1.4) Install tensorflow with GPU

Require v=1.13.1 as with CUDA 10.0

2) Train the MNIST model locally

  • Wrap a Tensorflow MNIST python model for use as a prediction microservice in seldon-core

    • Run locally on Docker to test

    • Deploy on seldon-core running on minikube

Dependencies

Train locally

Wrap model using s2i

Send some random features that conform to the contract

3) Push the image to Google Container Registry

Configure access to container registry (follow the configuration to link to your own project).

Tag Image with your project's registry path (Edit the command below)

Push the Image to the Container Registry (Again edit command below)

4) Deploy in GKE

Spin up a GKE Cluster

For this example only one node is needed within the cluster. The cluster should have the following config:

  • 8 CPUs

  • 30 GB Total Memory

  • 1 Node with 1X NVIDIA Tesla V100 GPU

  • Ubuntu Node image

Leave the rest of the config as default.

Connect to your cluster and check the context.

Installing NVIDIA GPU device drivers

(The below command is for the Ubuntu Node Image - if using a COS image, please see the Google Cloud Documentation for the correct command).

Setup Seldon Core

Use the setup notebook to Setup Cluster with Ambassador Ingress and Install Seldon Core. Instructions also online.

Build the Seldon Graph

First lets look at the Seldon Graph Yaml file:

Change the image name in this file (line 24) to match the path to the image in your container registry.

Next, we are ready to build the seldon graph.

Check the deployment is running

Test the deployment with test data

Change the IP address to the External IP of your Ambassador deployment.

Clean up

Make sure you delete the cluster once you have finished with it to avoid any ongoing charges.

Last updated

Was this helpful?