Distributed PaddlePaddle Training Core Concepts

Distributed Training Job

Each Kuberentes job is described by a job config file, which specifies the information like the number of pods in the job and environment variables.

In a distributed training job, we would:

prepare partitioned training data and configuration file on a distributed file system (in this tutorial we use Amazon Elastic File System), and

create and submit the Kubernetes job config to the Kubernetes cluster to start the training job.

Parameter Servers and Trainers

There are two roles in a PaddlePaddle cluster: parameter server (pserver) and trainer. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model.

In order to communicate with pserver, trainer needs to know the ip address of each pserver. In kubernetes it's better to use a service discovery mechanism (e.g., DNS hostname) rather than static ip address, since any pserver's pod may be killed and a new pod could be schduled onto another node of different ip address. However, now we are using static ip. This will be improved.

Parameter server and trainer are packaged into a same docker image. They will run once pod is scheduled by kubernetes job.

Trainer ID

Each trainer process requires a trainer ID, a zero-based index value, passed in as a command-line parameter. The trainer process thus reads the data partition indexed by this ID.

Training

The entry-point of a container is a shell script. It can see some environment variables pre-defined by Kubernetes. This includes one that gives the job's identity, which can be used in a remote call to the Kubernetes apiserver that lists all pods in the job.

We rank each pod by sorting them by their ips. The rank of each pod could be the "pod ID". Because we run one trainer and one parameter server in each pod, we can use this "pod ID" as the trainer ID. A detailed workflow of the entry-point script is as follows:

Query the api server to get pod information, and assign the trainer_id by sorting the ip.

Copy the training data from EFS persistent volume into container.

Parse the paddle pserver and paddle trainer startup parameters from environment variables, and then start up the processes.

PaddlePaddle on AWS with Kubernetes

Choose AWS Service Region

This tutorial requires several AWS services work in the same region. Before we create anything in AWS, please check the following link
https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
Choose a region which has the following services available: EC2, EFS, VPS, CloudFormation, KMS, VPC, S3.
In this tutorial, we use "Oregon(us-west-2)" as example.

Create AWS Account and IAM Account

Under each AWS account, we can create multiple IAM users. This allows us to grant some privileges to each IAM user and to create/operate AWS clusters as an IAM user.

To sign up an AWS account, please
follow
this guide.
To create IAM users and user groups under an AWS account, please
follow
this guide.

Please be aware that this tutorial needs the following privileges for the user in IAM:

EC2 key pair

After creating a key pair, you will use the key pair name to configure the cluster.

Key pairs are only available to EC2 instances in the same region. We are using us-west-2 in our tutorial, so make sure to creat key pairs in that region (Oregon).

Your browser will download a key-name.pem file which is the key to access the EC2 instances. We will use it later.

KMS key

Amazon KMS keys are used to encrypt and decrypt cluster TLS assets. If you already have a KMS Key that you would like to use, you can skip creating a new key and provide the Arn string for your existing key.

Version : Its value has to be exactly "2012-10-17".
AWS_ACCOUNT_ID: You can get it from following command line:

aws sts get-caller-identity --output text --query Account

MY_CLUSTER_NAME: Pick a MY_CLUSTER_NAME that you like, you will use it later as well.
Please note, stack name must satisfy regular expression pattern: [a-zA-Z][-a-zA-Z0-9], which means no "_" or "-" in stack name, or kube-aws will throw error in later steps.

External DNS name

When the cluster is created, the controller will expose the TLS-secured API on a DNS name.

DNS name should have a CNAME points to cluster DNS name or an A record points to the cluster IP address.

We will need to use DNS name later in tutorial. If you don't already own one, you can choose any DNS name (e.g., paddle) and modify /etc/hosts to associate cluster IP with that DNS name for your local machine. And add name service (route53) in aws to associate the IP to paddle for cluster. We will find the cluster IP in later steps.

S3 bucket

You need to create an S3 bucket before startup the Kubernetes cluster.

There are some bugs in aws cli in creating S3 bucket, so let's use the S3 Console.

Click on Create Bucket, fill in a unique BUCKET_NAME, and make sure region is us-west-2 (Oregon).

Initialize Assets

Create a directory on your local machine to hold the generated assets:

$ mkdir my-cluster
$ cd my-cluster

Initialize the cluster CloudFormation stack with the KMS Arn, key pair name, and DNS name from the previous step:

Setup Elastic File System for Cluster

Look up security group id for paddle-cluster-sg-worker (sg-055ee37d in the image below)

Add security group paddle-efs with ALL TCP inbound rule and custom source as group id of paddle-cluster-sg-worker. And VPC of paddle-cluster-vpc. Make sure availability zone is same as the one you used in Initialize Assets.

Create the Elastic File System in EFS console with paddle-cluster-vpc VPC. Make sure subnet is paddle-cluster-Subnet0 andd security group is paddle-efs.