Pokemon Go: How the cloud saved the smash hit game from collapse

Pokemon Go's creators reveal how they worked with Google to keep the service afloat in the face of explosive demand.

Image: iStock/KeongDaGreat

The smash hit game Pokemon Go provided perhaps the ultimate test of the claim that cloud computing can scale to meet demand.

The augmented reality, collect-em-up has been downloaded more than 500 million times, making it one of the most popular mobile apps of 2016.

The sheer number of people hankering to catch Pokemon not only generated headlines, it also caught Pokemon Go's creators Niantic by surprise.

The game, which relies on more than a dozen Google Cloud Platform services, created far more network traffic than Niantic's thought possible, ultimately generating five times the demand Niantic expected under its worst case scenario.

The first sign that they may have underestimated player numbers came within 15 minutes of launching Pokemon Go, which initially went live in Australia and New Zealand, when player traffic ballooned to levels far beyond Niantic's expectations.

"We ended up 50 percent over our worst case after day one, we figured this was going to be bad within six hours," Phil Keslin, CTO at Niantic told the Google Cloud Platform Next conference in London.

Making the clamour for the game even more challenging was the small size of Niantic's engineering team handling the launch for Pokemon Go.

"The team that made this happen was four engineers," said Keslin.

Given the unexpected surge for the Australia and New Zealand launch, Niantic appealed to Google for help with the US release the next day. Niantic worked with about 40 Google Cloud Platform staff, from Customer Reliability Engineering and Site Reliability Engineering, development, product, support and executive teams, to ramp up the allocated infrastructure to cope with traffic levels far higher than originally envisioned.

However, the launch period didn't pass without incident, following the US and Australian launch, players complained of downtime and dropped connections, as Google Cloud Platform services such as Google BigQuery and BigTable came under unforeseen pressure.

"BigTable, BigQuery, everything was blowing up when this was happening," said Keslin.

The problems even prompted Werner Vogels, CTO at Google Cloud Platform rival Amazon Web Services to tweet asking Niantic to let him know if there was anything AWS could do to help.

Niantic and Google worked together to solve each issue, reviewing each part of the architecture used to identify the source of each problem. Complicating the matter of resolving issues were the millions of new players joining the game each day.

In an operation described as 'swapping out a plane's engine in-flight', Google and Niantic upgraded Pokemon Go to run on a newer version of Google Container Engine (GKE) at a time when millions of new players were signing up to play the game. The move allowed more than one thousand additional nodes to be added to Pokemon Go's container cluster, in preparation for the high demand expected for the forthcoming launch in Japan.

Other upgrades carried out on the live production system, included replacing the Network Load Balancer with the more sophisticated HTTP/S Load Balancer, in order to offer faster connections to users and a higher overall traffic throughput.

The upshot was the launch in Japan passed off without incident, despite the number of new users signing up to play being triple that of the earlier US launch.

"We had people on the Google side reprovisioning resources to get us where we needed to be," said Keslin.