GKE with GPU Tensorflow Deep MNIST
Please note: This tutorial uses Tensorflow-gpu=1.13.1, CUDA 10.0 and cuDNN 7.6
Requirements: Ubuntu 18.+ and Python 3.6
In this tutorial we will run a deep MNIST Tensorflow example with GPU.
The tutorial will be broken down into the following sections:
Install all dependencies to run Tensorflow-GPU
1.1 Installing CUDA 10.0
1.2 Installing cuDNN 7.6
1.3 Configure CUDA and cuDNN
1.4 Install Tensorflow GPU
Train the MNIST model locally
Push the Image to your proejcts Container Registry
Deploy the model on GKE using Seldon Core
Local Testing Environment
For the development of this example a GCE Virtual Machine was used to allow access to a GPU. The configuration for this VM is as follows:
VM Image: TensorFlow from NVIDIA
8 vCPUs
32 GB memory
1x NVIDIA Tesla V100 GPU
1) Installing all dependencies to run Tensorflow-GPU
Dependencies installed in this section:
Nvidia compute 3.0 onwards
CUDA 10.0
cuDNN 7.6
tensorflow-gpu 1.13.1
Check Nvidia drivers >= 3.0
1.1) Install CUDA 10.0
Download the CUDA 10.0 runfile
Unpack the separate files:
Install the Cuda 10.0 Toolkit file:
From the terminal, run the following command
Hold 'd' to scroll to the bottom of the license agreement.
Accept the licencing agreement and all of the default settings.
Verify the install, by installing the sample test:
Again, accept the agreement and all default settings
Configure the runtime library:
Add the cuda bin to the file system:
Add ‘:/usr/local/cuda/bin’ to the end of the PATH (inside quotes)
Reboot the system
Run the tests that we set up - this takes some time to complete, so let it run for a little while...
If run into an error involving the GCC version:
And run again, otherwise, skip this step.
After complete, run a devicequery and bandwidth test:
Remember to clean up by removing all of the downloaded runtime packages
1.2) Install cuDNN 7.6
Download all 3 .deb files for CUDA10.0 and Ubuntu 18.04
You will have to create a Nvidia account for this and go to the archive section of the cuDNN downloads
Ensure you download all 3 files:
Runtime
Developer
Code Samples
Unpackage the three files in this order
Verify the install is successful with the MNIST example
From the download folder. Copy the files to somewhere with write access:
Go to the MNIST example code, compile and run it
Remember to clean up by removing all of the downloaded runtime packages
1.3) Configure CUDA and cuDNN
Add LD_LIBRARY_PATH in your .bashrc file:
Add the following line in the end or your .bashrc file export export:
And source it with:
1.4) Install tensorflow with GPU
Require v=1.13.1 as with CUDA 10.0
2) Train the MNIST model locally
Wrap a Tensorflow MNIST python model for use as a prediction microservice in seldon-core
Run locally on Docker to test
Deploy on seldon-core running on minikube
Dependencies
Train locally
Wrap model using s2i
Send some random features that conform to the contract
3) Push the image to Google Container Registry
Configure access to container registry (follow the configuration to link to your own project).
Tag Image with your project's registry path (Edit the command below)
Push the Image to the Container Registry (Again edit command below)
4) Deploy in GKE
Spin up a GKE Cluster
For this example only one node is needed within the cluster. The cluster should have the following config:
8 CPUs
30 GB Total Memory
1 Node with 1X NVIDIA Tesla V100 GPU
Ubuntu Node image
Leave the rest of the config as default.
Connect to your cluster and check the context.
Installing NVIDIA GPU device drivers
(The below command is for the Ubuntu Node Image - if using a COS image, please see the Google Cloud Documentation for the correct command).
Setup Seldon Core
Use the setup notebook to Setup Cluster with Ambassador Ingress and Install Seldon Core. Instructions also online.
Build the Seldon Graph
First lets look at the Seldon Graph Yaml file:
Change the image name in this file (line 24) to match the path to the image in your container registry.
Next, we are ready to build the seldon graph.
Check the deployment is running
Test the deployment with test data
Change the IP address to the External IP of your Ambassador deployment.
Clean up
Make sure you delete the cluster once you have finished with it to avoid any ongoing charges.
Last updated
Was this helpful?