How to train Deep Learning models on AWS Spot Instances using Spotty?

Spotty is a tool that simplifies training of Deep Learning models on AWS.

Why will you ❤️ this tool?

it makes training on AWS GPU instances as simple as a training on your local computer

it automatically manages all necessary AWS resources including AMIs, volumes and snapshots

it makes your model trainable on AWS by everyone with a couple of commands

it detaches remote processes from SSH sessions

it saves you up to 70% of the costs by using Spot Instances

To show how it works let’s take some non-trivial model and try to train it. I chose one of the implementations of Tacotron 2. It’s a speech synthesis system by Google.

Clone the repository of Tacotron 2 to your computer:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git

Docker Image

Spotty trains models inside a Docker container. So we need to either find a publicly available Docker image that satisfies the model’s requirements, or create a new Dockerfile with a proper environment.

This implementation of Tacotron uses Python 3 and TensorFlow, so we could use the official Tensorflow image: tensorflow/tensorflow-gpu-p3. But this image doesn’t satisfy all the requirements from the “requirements.txt” file. So we need to extend this image and install all necessary libraries on top.

Section 1: Project

Name of the project: the name will be used in names of AWS resources. For example, in the name of the S3 bucket that will be used to synchronize the project code with the instance.

Remote directory: it’s a directory where the project will be stored on the instance.

Synchronization filters: filters are being used to exclude directories which shouldn’t be synchronized with the instance. For example, we ignore PyCharm configuration, Git files, Python cache files and training data.

List of volumes: each volume has a name, a directory where the volume will be mounted and a size.When you’re starting an instance the first time, the volume will be created. When you’re stopping the instance, a snapshot will be taken and automatically restored next time.

Docker: here we set the path to our Dockerfile. An alternative approach is to build the image locally and push it to the Docker Hub Registry, then you can use the name of the image instead of a file.We set a working directory, it will be used by the scripts from the “scripts” section.Also, we can change a Docker data root directory to a directory on the attached volume, then the downloaded images will be saved with a snapshot of the volume. Next time it will take less time to restore the image.

Ports: ports to expose. In this example, we open 2 ports: 6006 for TensorBoard and 8888 for Jupyter Notebook.

Spotty Installation

Requirements

Installation

2. Create an AMI with NVIDIA Docker. Run the following command from the root directory of your project (where the spotty.yaml file is located):

$ spotty create-ami

In several minutes you will have an AMI that can be used for all your projects within the AWS region.

Model Training

1. Start a Spot Instance with the Docker container:

$ spotty start

Once the instance is up and running, you will see its IP address. Use it to open TensorBoard and Jupyter Notebook later.

2. Download and preprocess the data for the Tacotron model. We already have a custom script in the configuration file to do that, just run:

$ spotty run preprocess

3. Once the preprocessing is done, train the model. Run the “train” script:

$ spotty run train

On a “p2.xlarge” instance it will probably take around 8–9 days to reach 120k steps. But you could use instances with more performant GPUs to make the training faster.

You can detach this SSH session using Ctrl + b, then d combination of keys. The training process won’t be interrupted. To reattach that session, just run the spotty run train command again.

TensorBoard

Start the TensorBoard using the “tensorboard” script:

$ spotty run tensorboard

TensorBoard will be running on the port 6006. You can detach the SSH session using Ctrl + b, then d combination of keys, it still will be running.

Jupyter Notebook

You can use Jupyter Notebook to download trained models to your computer. Use the “jupyter” script to start it:

$ spotty run jupyter

Jupyter Notebook will be running on the port 8888. Open it using the IP address of the instance and the URL that you see in the output of the command.

SSH Connection

To connect to the running Docker container via SSH, use the following command:

$ spotty ssh

It uses a tmux session, so you can always detach it using Ctrl + b, then d combination of keys and attach that session later using the spotty ssh command again.

Don’t forget to stop the instance once you are done! Use the following command:

$ spotty stop

When you’re stopping the instance, Spotty automatically creates snapshots of the volumes. When you will be starting an instance next time, it will restore the snapshots automatically.

Conclusion

Using Spotty is a convenient way to train Deep Learning models on AWS Spot Instances. It will save you not just up to 70% of the costs, but also a lot of time on setting up an environment for your models and notebooks. Once you have a Spotty configuration for your model, everyone can train it with a couple of commands.

If you enjoyed this post, please star the project on GitHub, click the 👏 button and share this post with your friends.