Meet Kripa Krishnan, Google's queen of chaos

Earthquakes and cosmic hits are part of the arsenal that Kripa Krishnan and her DiRT team use to wreak havoc on the company to make sure it keeps running, no matter whatBusiness Insider | August 15, 2016, 10:17 IST

If aliens invade the planet, or an earth quake destroys California, Google wants you to know that it will still be there for you.

That is because a team of 10 engineers goes around purposely breaking things at the company to make sure that Google can keep itself running no matter what happens.

This is Google's Disaster Recovery Testing (DiRT) team, led by director Kripa Krishnan, who has perfected the art of crashing Google over the past 10 years.

A serious business

Despite the humour-inspired scenarios she inflicts on her co-workers, the tests themselves are serious, she says.

The team takes down live systems, sometimes whole data centres. If things go badly for too long and cause Google to have a serious outage, there will be money lost and hell to pay. "Before each test, around 20-30 people gather in the war rooms, sometimes located all over the world," says Krishnan, adding that her small band roams Google, wreaking havoc, with the help of hundreds of other Google experts who are called upon to work on the tests.

But, if things go wrong, which they often do, the stress in the war room heats up, she says. For instance, during an orchestrated test on the network, the team noticed that a popular app was slowing down. They wondered if they should abort the test, which would have been dangerous at that moment, she recalls.

Within 15 minutes they determined their test was not the problem. "But for those 15 minutes, we were yipping at each other. [There was] Shouting and tears in the war room," she says.

Beg, borrow, and credit card

One time, Krishnan and her team told the data centre engineers that a flood had forced the data centre off the grid and onto a backup generator running on diesel fuel. The idea was to activate Google's procedure to release emergency funds for the purchase of massive quantity of fuel. But, no matter what the DiRT team threw at them (a bigger flood, a fire in another room), the engineers came up with "creative solutions to find the money".

They asked the local community for help.Someone even offered the use of a credit card to pay for things. The team never called the person that would send them the emergency fund, but it nev er let the Google site go down either.

The HR department, too, came out on top after being presented with a scenario that a meteor had crashed into earth, stranding employees all over the world.

Krishnan says the HR was bombarded with questions such as approval for expensing a US$15,000 flight home, or buying clothes due to lost luggage. The HR team "shocked" the DiRT team by organising itself and handling the onslaught well.

Next up: automated 'chaos engineering'

Google isn't the only internet giant in the Valley that needs to make sure it never goes down. A small team of testers from other com panies has started to work together to share best practices. "They call their young discipline chaos engineering," Krishnan says. "Right now, scale is our problem. We are doing hundreds of tests, but I cannot scale my team to hundreds of people. So we are exploring automating some of this," she adds.

Nearly a decade of testing has taught Krishnan one thing: it's not enough to have disaster plans.People must test, change and perfect them. "We want people to practice so they get the right concepts. And then we trust them to wing it. Give them space for solving the problem," she says.