Netflix – Exemplar blueprint for Cloud Native Computing

The uber poster child of migrating legacy applications and IT systems via the ‘Cloud Native’ approach is Netflix. Not only do they share their best practices via blogs, they also share the software they’ve created to make it possible via open source.

Migrating to Web-Scale IT

They describe how pioneering organizations like Netflix are entirely embracing a Cloud paradigm for their business, moving away from the traditional approach of owning and operating your own data centre populated by EMC, Oracle and VMware.

Instead they are moving to ‘web scale IT’ via on demand rental of containers, commodity hardware and NoSQL databases, but critically it’s not just about swapping out the infrastructure components.

Cloud Migration Best Practices

In this blog they focus on the migration of the core Netflix billing systems from their own data centre to AWS, and from Oracle to a Cassandra / MySQL combination, emphasizing in particular the scale and complexity of this database migration part of the Cloud Migration journey.

This inital quote from the Netflix blog sets the scene accordingly:

On January 4, 2016, right before Netflix expanded itself into 130 new countries, Netflix Billing infrastructure became 100% AWS cloud-native.

They also reference a previous blog also describing this overall AWS journey, again quickly making the most incisive point – this time describing the primary inflection point in CIO decision making that this shift represents, a move to ‘Web Scale IT‘:

That is when we realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud.

Cloud Migration: Migrating Mission-critical Systems

They then go on to explain their experiences of a complex migration of highly sensitive, operational customer systems from their own data centre to AWS.

As you might imagine the core customer billing systems are the backbone of a digital delivery business like Netflix, handling everything from billing transactions through reporting feeds for SOX compliance, and face a ‘change the tyre while the car is still moving’ challenge of keeping front-facing systems available and consistent to ensure unbroken service for a globally expanding audience, while conducting a background process of migrating terabytes of data from on-site enterprise databases into the AWS service.

We had billions of rows of data, constantly changing and composed of all the historical data since Netflix’s inception in 1997. It was growing every single minute in our large shared database on Oracle. To move all this data over to AWS, we needed to first transport and synchronize the data in real time, into a double digit Terabyte RDBMS in cloud.

Being a SOX system added another layer of complexity, since all the migration and tooling needed to adhere to our SOX processes.

Netflix was launching in many new countries and marching towards being global soon.

Billing migration needed to happen without adversely impacting other teams that were busy with their own migration and global launch milestones.”

The scope of data migration and the real-time requirements highlight the challenging nature of Cloud Migrations, and how it goes far beyond a simple lift and shift of an application from one operating environment to another.

Database Modernization

The backbone of the challenge was how much code and data was interacting with Oracle, and so their goal was to ‘disintegrate’ that dependency into a services based architecture.

“Moving a database needs its own strategic planning:

Database movement needs to be planned out while keeping the end goal in sight, or else it can go very wrong. There are many decisions to be made, from storage prediction to absorbing at least a year’s worth of growth in data that translates into number of instances needed, licensing costs for both production and test environments, using RDS services vs. managing larger EC2 instances, ensuring that database architecture can address scalability, availability and reliability of data. Creating disaster recovery plan, planning minimal migration downtime possible and the list goes on. As part of this migration, we decided to migrate from licenced Oracle to open source MYSQL database running on Netflix managed EC2 instances.”

Overall this transformation scope and exercise included:

APIs and Integrations – The legacy billing systems ran via batch job updates, integrating messaging updates from services such as gift cards, and billing APIs are also fundamental to customer workflows such as signups, cancellations or address changes.

Globalization – Some of the APIs needed to be multi-region and highly available, so data was split into multiple Cassandra data stores. A data migration tool was written that transformed member billing attributes spread across many tables in oracle into a much smaller Cassandra structure.

ACID – Payment processing needed ACID transaction, and so was migrated to MySQL. Netflix worked with the AWS team to develop a multi-region, scalable architecture for their MySQL master with DRBD copy and multiple read replicas available in different regions, with toolingn and alerts for MySQL instances to ensure monitoring and recovery as needed.

Data / Code Purging – To optimize how much data needed migrated, the team conducted a review with business teams to identify what data was still actually live, and from that review purged many unnecessary and obsolete data sets. As part of this housekeeping obsolete code was also identified and removed.

A headline challenge was the real-time aspect, ‘changing the tyre of the moving car’, migrating data to MySQL that is constantly changing. This was achieved through Oracle GoldenGate, which could replicate their tables across heterogeneous databases, along with ongoing incremental changes. It took a heavy testing period of two months to complete the migration via this approach.

Downtime Switchover

Downtime was needed for this scale of data migration, and to mitigate impact for users Netflix employed an approach of ‘decoupling user facing flows to shield customer experience from downtimes or other migration impacts’.

All of their tooling was built around ability to migrate a country at a time and funnel traffic as needed. They worked with ecommerce and membership services to change integration in user workflows to an asynchronous model, building retry capabilities to rerun failed processing and repeat as needed.

An absolute requirement was SOX Compliance, and for this Netflix made use of components available in their OSS open source suite.

Our Cloud deployment tool Spinnaker was enhanced to capture details of deployment and pipe events to Chronos and our Big Data Platform for auditability. We needed to enhance Cassandra client for authentication and auditable actions. We wrote new alerts using Atlas that would help us in monitoring our applications and data in the Cloud.

Building HA, Globally Distributed Cloud Applications with AWS

Netflix provides a detailed, repeatable best practice case study for implementing AWS Cloud services, at an extremely large scale, and so is an ideal baseline candidate for any enterprise organization considering the same types of scale challenges, especially with an emphasis on HA – High Availability.

Two Netflix presentations: Globally Distributed Cloud Applications, and From Clouds to Roots provide a broad and deep review of their overall global architecture approach, in terms of exploiting AWS with the largest and most demanding of of capacity and growth requirements, such as hosting tens of thousands of virtual server instances to operate the Netflix service, auto-scaling by 3k/day.

This goes into a granular level of detail of how they monitor performance, and then additionally in they focus specifically on High Availability Architecture, providing a broad and deep blueprint for this scenario requirements.

Netflix Spinnaker – Global Continuous Delivery

In short these address the two core, common requirements of enterprise organizations, their global footprint and associated application hosting and content delivery requirements, and also their own software development practices – How better can they optimize the IT and innovation processes that deploys the software systems that needs this infrastructure.

Build Code Like Netflix – Continuous Deployment

The ideal of our ‘repo guide’ for the Netflix OSS suite is for it to function as a ‘recipe’ for others to follow, ie You too can Build Code Like Netflix.

Cloud Native Toolchain

In this blog Global Continuous Delivery With Spinnaker they explain how it addresses this scope of the code development lifecycle, across global teams, and forms the backbone of their DevOps ‘toolchain’, integrating with other tools such as Git, Nebula, Jenkins and Bakery.

As they describe:

Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models.

Moving from Asgard

Their history leading up to the conception and deployment of Spinnaker is helpful reading too; previously they utilized a tool called ‘Asgard’, and in Moving from Asgard:, describe the limitations they reached using that type of tool, and how instead they sought a new tool that could achieve: