Tensorflow is an open-source framework for high-performance numerical computation. It is used for both research and production systems on a variety of platforms from mobile and edge devices, to desktops and clusters of servers. Its main use is Machine Learning and especially Deep Learning.

In this post, we will show you how you can use your MooseFS cluster with Tensorflow to train deep neural networks!

Training and evaluation on separate machines

Imagine you have a machine with 1-4 GPUs and you want to use all of its for training but you still want to see the evaluation results. If you don’t want to run the evaluation job on CPU only, you can run the evaluation on other GPU enabled host. We will use a simple configuration where we have 3 servers with Tensorflow installed: model training, evaluation and Tensorboard graph visualization tool.

Hardware configuration

In this demo we are using three hosts with the following hardware/software:

Training host

GPU (optional, recommended)

MooseFS Client (required)

MooseFS Chunkserver (optional)

Evaluation host

GPU (optional, recommended)

MooseFS Client (required)

MooseFS Chunkserver (optional)

We assume training and evaluation hosts have one or more GPUs. It is recommended configuration since GPUs speed up deep neural networks training process many times. The third machine is used only for visualization of training/evaluation process. It runs tensorboard which is reading logs from a specified directory.

Of course, you can install Tensorflow on chunkservers or install MooseFS Chunkserver in your Tensorflow GPU-enabled host.

Computation on separate nodes

In this case, we consider training Tensorflow models on separate machines. We assume operating systems are installed on new hosts, and hosts are in the same local network. Tensorflow machines have MooseFS Clients connected to the MooseFS Storage Cluster.

Tensorflow with MooseFS

Installation

We will set up two identical machines for Tensorflow with 4 CPU cores, 16GB of RAM and one GPU – Tesla K80. Machines will be using standard 100GB hard drives. The configuration is presented on a screenshot below. We will create one and install Tensorflow and MooseFS and we will copy it afterward.

Machines with GPUs cost more! Check the pricing or go to the Pricing Calculator to estimate costs. You can’t create GPU enabled machines using free trial money, you will be charged.

Tensorflow machine parameters

You need to add port 6006 to firewall rules for tensorflow0 to be able to see training progress in Tensorboard. We will add network tag tensorboard to the tensorflow0 machine. This can be done only when the machine is switched off.

MooseFS In Google Cloud – Adding a network tag for Tensorboard

We will create a firewall rule for label tensorboard

MooseFS in Google Cloud – Create Firewall rule for Tensorboard

Install python

Let us install python3 and Tensorflow on Tensorflow0 machine. We will start by updating Ubuntu.

sudo apt update
sudo apt upgrade -y

Python 3.5.2 should be preinstalled already, so let us check the version of Python3 by typing:

Install CUDA

Now we are ready to install Tensorflow. We want to install GPU enabled version of Tensorflow, so let us check the hardware installed. But first, we will need to install CUDA. CUDA 9.0 is our recommendation for Ubuntu 16.04 and Tensorflow 1.9.0.

This will create a directory called inception-v3 which contains the following files.

ls inception-v3
# README.txt checkpoint model.ckpt-157585

Now we will start training on flowers dataset

# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
cd ~/tensorflow/models/research/inception
bazel build //inception:flowers_train
bazel build //inception:flowers_eval

Model is saved every 5000 steps and evaluation code is looking for a new checkpoint every 5 minutes. After 2 hours of training your Tensorboard should look similar to this:

Evaluation in Tensorboard

Summary

In this post, you have learned how to use MooseFS with Tensorflow to start evaluation on the other host and evaluate the model while training. Tensorflow is writing logs and saving models to MooseFS storage, so all of the files are available on both machines.

You can also run another training simultanously and get more results with different parameters faster!