“our model predicts that a GPU is 32% slower than a TPU for this specific scenario”;

We can expect to train BERT on 64 GPUs (the equivalent to 4 TPU pods) in 5 1/3 days or 8 1/2 days. On an 8 GPU machine for V100/RTX 2080 Tis with any software and any parallelization algorithm (PyTorch, TensorFlow) one can expect to train BERT in 42 days or 68 days. For a standard 4 GPU desktop with RTX 2080 Ti (much cheaper than other options), one can expect to replicate BERT in 99 days;