Running Image Classification Models On Cloud TPUs

Beta

This is a Beta release of
Google Cloud TPU. This API
is not covered by any SLA or deprecation policy and may be subject to backward-incompatible
changes.

This tutorial shows you how to train the Tensorflow ResNet-50 model
on Google Cloud TPU. You can apply the same pattern to other TPU-optimised
image classification models that use TensorFlow and the ImageNet dataset.
In particular, the TensorFlow SqueezeNet model is optimized
for running on TPUs.

The object detection model, RetinaNet,
has also been optimized for TPUs.

Disclaimer

This tutorial uses a third-party dataset. Google provides no
representation, warranty, or other guarantees about the validity, or any other
aspects of, this dataset.

This tutorial uses
tf.estimator to train the
model. tf.estimator is a high-level TensorFlow API and is the recommended way
to build and run a machine learning model on Google Cloud TPU. The API
simplifies the model development process by hiding most of the low-level
implementation, making it easier to switch between TPU and other platforms such
as GPU or CPU.

Before you begin

Before starting this tutorial, check that your Cloud project is correctly set
up, and create a Compute Engine VM and a TPU resource.

This section is identical to the Quickstart guide. If
you already completed the Quickstart without deleting your VM and TPU resource,
you can skip directly to getting the data.

--machine-type=n1-standard-4 is a standard machine
type with 4 virtual CPUs and 15 GB of memory. See
Machine Types for more
machine types.

--image-project=ml-images is a shared
collection of images that makes the tf-1.6 image
available for your use.

--image-family=tf-1-6 is an image with the
required pip package for TensorFlow.

--scopes=cloud-platform allows the VM to access
Cloud Platform APIs.

Create a new Cloud TPU resource. For this example, name the resource
demo-tpu. Keep in mind that billing begins as soon as the
TPU is created, until the time it is deleted. (Check the
Cloud TPU pricing page to
estimate your costs.) If you are using a dataset that requires a substantial
download and processing phase, hold off on running this command until you
are ready to use the TPU:

--range specifies the address of the created TPU resource and can be
any value in 10.240.*.*/29. For this example, use 10.240.1.0/29.

Get the data

Below are the instructions for using a randomly generated fake dataset
to test the model. Alternatively, you can use the full ImageNet dataset.

Use the fake dataset for testing purposes

The fake dataset is at this location on Cloud Storage:

gs://cloud-tpu-test-datasets/fake_imagenet

Note that the fake dataset is only useful for understanding how to use a
Cloud TPU, and validating end-to-end performance. The accuracy numbers and saved
model will not be meaningful.

Grant storage access to the TPU

You need to give your TPU read/write access to Cloud Storage objects.
To do that, you must grant the required access to the service account used by
the TPU. Follow these steps to find the TPU service account and grant the necessary
access:

List your TPUs to find their names:

$ gcloud beta compute tpus list

Use the describe command to find the service account of your
TPU, where demo-tpu is the name of your TPU resource:

$ gcloud beta compute tpus describe demo-tpu

Copy the name of the TPU service account from the output of the
describe command. The name has the format of an email
address, like 12345-compute@my.serviceaccount.com.

Run the ResNet-50 model

-L 6006:localhost:6006 port forwards the Tensorboard port from the VM to your
local machine.

Run the following command, where demo-tpu is the name of the TPU resource
you created earlier:

(vm)$ export TPU_NAME=demo-tpu

Run the following command, replacing [DATA_DIR] with
gs://cloud-tpu-test-datasets/fake_imagenet if you are using the
fake dataset, or with the path to the Cloud Storage bucket containing your
training data:

(vm)$ export DATA_DIR=[DATA_DIR]

Create a Cloud Storage bucket to store the trained model and training logs.
Cloud Storage bucket names must be unique:

What to expect

The above procedure trains the ResNet-50 model for 100 epochs and evaluates
every fixed number of steps. With the default flags, the model should train to
above 75% accuracy.

TPU-specific modifications to the ResNet-50 model

The ResNet code in this tutorial uses
TPUEstimator
which is based on the high-level Estimator
API. There are a few
code changes that are required in order to convert an Estimator-based model to a
TPUEstimator-based model for training.

Select the route that Google automatically created as part of the
Cloud TPU setup. The peering entry starts with
peering-route in the ID.

At the top of the Network Routes page, click Delete to delete the
selected route.

When you've finished finished examining the data, use the gsutil command to
delete any Cloud Storage buckets you created during this tutorial. (See the
Cloud Storage pricing guide for free storage limits and other
pricing information.) Replace my-bucket-name with the name of your
Cloud Storage bucket:

$ gsutil rm -r gs://my-bucket-name

Using the full ImageNet dataset

Download and convert the ImageNet data:

Sign up for an ImageNet account. Remember the
username and password you used to create the account.

Create a Cloud Storage bucket for the dataset. Cloud Storage bucket names must
be unique:

Run the imagenet_to_gcs.py script to download, format, and upload the
ImageNet data to the bucket. This script requires you to have around 300GB of space
available on your local machine. Replace [USERNAME] and [PASSWORD] with
the username and password you used to create your ImageNet account.

What's next

Run the TensorFlow SqueezeNet model on Google Cloud TPU, using
the above instructions as your starting point. The model architectures for
SqueezeNet and ResNet-50 are similar. You can use the same data and the same
command-line flags to train the model.