Google Cloud Storage

Overview

Google Cloud Storage is often used along with CloudML to manage and serve training data. This article provides details on:

Copying and synchronizing files between your local workstation and Google Cloud.

Reading data from Google Cloud Storage buckets from within a training script.

Varying data source configuration between local script development and CloudML training.

Copying Data

Google Cloud Storage is organized around storage units named “buckets”, which are roughly analogous to filesystem directories. You can copy data between your local system and cloud storage using the gs_copy() function. For example:

Note that to use these functions you need to import the cloudml package with library(cloudml) as illustrated above.

Reading Data

There are two distinct ways to read data from Google Storage. Which you use will depend on whether the TensorFlow API you are using supports direct references to gs:// bucket URLs.

If you are using the TensorFlow Datasets API, then you can use gs:// bucket URLs directly. In this case you’ll want to use the gs:// URL when running on CloudML, and a synchonized copy of the bucket when running locally. You can use the gs_data_dir() function to accomplish this. For example:

While some TensorFlow APIs can take gs:// URLs directly, in many cases a local filesystem path will be required. If you want to store data in Google Storage but still use it with APIs that require local paths you can use the gs_data_dir_local() function to provide the local path.

Note that if the path passed to gs_data_dir_local() is from the local filesystem it will be returned unmodified.

Data Source Configuration

It’s often useful to do training script development with a local subsample of data that you’ve extracted from the complete set of training data. In this configuration, you’ll want your training script to dynamically use the local subsample during development then use the complete dataset stored in Google Cloud Storage when running on CloudML. You can accomplish this with a combination of training flags and the gs_local_dir() function described above.

Here’s a complete example. We start with a training script that declares a flag for the location of the training data: