Cloud TV Redundancy: Minimizing Risk and Ensuring Uptime

By Ilan Petrydes. Ilan Petrydes is a Pre-Sales Director at Viaccess–Orca. Ilan has 20 years of experience in pre-sales management, solution architecture and product consultancy for Telco’s, enterprise and video streamers. Prior to VO, he worked with Cisco, Motorola-Mobility (Google) and niche software solution providers where he successfully designed launched and delivered innovative software solutions.

Few things keep broadcasters awake at night more than the nightmare of dead air and blank screens.

When NBC went blank during this year’s Super Bowl it made headlines around the world, proving that dead air is a costly mistake – especially when you’re charging millions of dollars per commercial slot.

Broadcast equipment and workflows, of course, have multiple levels of redundancy built into them, but a move towards using commoditised IT hardware has in some cases increased the chances of vulnerability. As we’ll see below though, one of the ways of mitigating against that is to utilise the cloud. The likes of Amazon Web Services and Microsoft Azure have invested huge amounts of money in building multiply redundant systems that stretch across continents.

There are already many compelling arguments for the deployment of Cloud TV: the move to an OPEX model, scalability, reduced time to market, reduced overhead, and more. Perhaps it is time that we added reliability to the list.

The Super Bowl LII Blackout

Over the course of many decades, the broadcast industry has evolved its systems to be extremely reliable. Redundancy is built in on so many levels that we have become used to everything simply working. When the clock ticks over and the red light goes on, broadcasters expect their equipment to work, especially in live environments.

Most of the time it does. The problem is, that of the many thousands of hours of television broadcast successfully, it’s those seconds when it isn’t that people tend to notice. And if it doesn’t in a very high-profile event, such as the Super Bowl, people tend to notice even more.

NBC’s almost 30-second long blackout during Super Bowl LII in February this year made headlines. Rather than airing a commercial break in the second quarter after Stephen Gostkowski’s field goal had cut the Eagle’s lead to 15-6, the screen went dark in many markets for nearly 30 seconds. An NBC statement said that: “We had a brief equipment failure that we quickly resolved. No game action or commercial time were missed.”

It looks like a localized ad that was meant to play in certain markets failed to playout. And while that will not have been booked at the national ad slot rate for the Super Bowl of $5 million, it is still a very expensive and high-profile mistake to make.

NBC wasn’t the only broadcaster to suffer problems that evening, either. Later in the game, just as the Patriots were trying to bring the score back from a 41-33 deficit, some Hulu customers lost all audio and video. A variety of onscreen messages were flashed up, ranging from the prosaic “no content available” to a flamewar-friendly claim that “rights restrictions” meant that it could no longer be played.

Hulu felt that it had to credit one month of free subscription of its live TV service to affected users as an apology.

Playstation Vue’s feed also had problems, all of which led The Verge to conclude that: “Nearly all of the services encountered minor snags like jumping back a few seconds every so often or occasional buffering. Internet TV still isn’t quite as rock solid for sports as an OTA antenna or cable, it seems.

Engineering Redundancy

Quite what impact the Super Bowl LII blackout will have on the growing demand for OTT sports is unknown. Certainly it has proved that problems in marquee broadcasts seem to lead to equally marquee and tentpole levels of coverage. And it has also concentrated people’s minds when it comes to the risks associated with live broadcast, especially as the industry embraces an increasing amount of IT workflows and moves towards an IP paradigm.

The problem is that specialist broadcast equipment has always had a tendency to being over-engineered precisely to mitigate against exactly that sort of dead air. This robust kit has then been deployed in workflows that have been created specifically to ensure there is no single point of failure. The move towards using commercial off-the-shelf hardware in IT-centric production workflows has introduced a greater note of uncertainty and lack of reliability into the systems. While the internet is a massively distributed network with an enormous amount of redundancy built in that can tolerate more fault points, the networks used to produce and distribute video are smaller and more vulnerable. Products suitable for System A are not necessarily the best fit for System B.

But, the flip side is that the internet has exposed a weakness in what we may term traditional broadcast hardware redundancy the equipment to be used for Disaster Recovery has been housed in the same building as the main systems. In terms of hardware failure, broadcasters have been covered. However, if there was a building-wide event — a natural disaster or even a triggering of internal sprinkler systems — they have been a lot more vulnerable.

This is why there has been a growing trend for national broadcasters to ensure that, for example, regional newsrooms and galleries can take over from the main studio in case of outages. It is also why the cloud has become an area of increasing interest to ensure broadcast reliability.

Minimizing Risk with the Cloud

Minimizing the risk of dead air by using the cloud may sound counter-intuitive to anyone who has heard tales of cloud-based, virtualized broadcast environments falling over under pressure. But, apart from the fact that many of them are apocryphal, the speed of evolution of cloud-services has to be taken into account, and one of the areas it has started to excel at is reliability.

This is all about the number of ‘9s’ they offer in their services. For broadcasters, an overall reliability of 99% — ‘two 9s’ — is not really much of an option. For 24/7 operation, that would result in a frankly alarming 7.3 hours of downtime and dead air every month. The following table illustrates the amount of downtime each service level provides.

When it comes to cloud computing, as you add more 9s to the degree of service reliability, historically you have also added more costs. What has changed in recent times is the amount of providers now able to offer 100% availability. Indeed, the tables at Cloud Harmony show an impressive amount of 100% availability over a 30 day period for most of the major cloud providers across the Compute, Storage, CDN and DNS categories.

What’s more, as a broadcast or operator you don’t have to take any notice of designing redundancy into a cloud-based system, because the cloud providers have already done that for you. Use the public cloud in particular, and you are purchasing access to a geographically distributed system that often stretches across continents and has been engineered on a scale much vaster than any broadcaster can envisage.

As playout specialist Pebble Beach’s CTO, Ian Cockett, points out in his excellent article on cloud-playout, there’s another advantage that cloud-services bring as well: that’s the ability to scale.

His point is that you don’t always need six 9s, for instance, of redundancy in a typical broadcast month. For normal broadcast operations, you could consider scaling back your demands back in periods of low demand, say during repeats of a reality TV show being screened in the wee hours of the morning. Then, when you had something really important underway (and somewhat mischievously he uses the example of the Super Bowl here) you scale redundancy to address the potential risk to revenue in case of failure.

Cloud TV Redundancy and the Risk Adverse Broadcaster

One of the interesting little sidebars in this story of cloud TV development is the role Netflix has played in it and in shaping AWS into such a market leader in particular. In 2010, Netflix said in a landmark blog post that it was choosing to use AWS rather than sink further effort into building its own data centers.

“Letting Amazon focus on data center infrastructure allows our engineers to focus on building and improving our business, the company said. “Amazon calls their web services “undifferentiated heavy lifting,” and that’s what it is. The problems they are trying to solve are incredibly difficult ones, but they aren’t specific to our business. Every successful internet company has to figure out great storage solutions, hardware failover, networking infrastructure, etc.

“We want our engineers to focus as much of their time as possible on product innovation for the Netflix customer experience; that is what differentiates us from our competitors.

In mid 2015, after five years of work, Netflix closed its last major data center.

While the likes of Amazon Web Services, can achieve 100% reliability across its data centers, there are more factors at work than simply that. For a start, an important caveat is that workflows need to be designed to take advantage of cloud data center redundancy. The temptation with any degree of virtualization is always simply to port over the existing workflow and replicate it in software, but this can lead to all manner of avoidable problems being ported over with it as well.

There are always going to be some factors outside of an operator’s control in any broadcast. Famously, in 2013 during Super Bowl XLVII, an electrical device that had been installed expressly to prevent a power outage in fact caused a half hour blackout and halted the game at the New Orleans Superdome. No broadcaster can do anything about that. But the redundancy built into the cloud offerings as a matter of course means increasingly that a significant part of the broadcast workflow can be offloaded onto systems designed from the ground up to be more redundant than the baseline for broadcast.