Topics

Featured in Development

Alex Bradbury gives an overview of the status and development of RISC-V as it relates to modern operating systems, highlighting major research strands, controversies, and opportunities to get involved.

Featured in Architecture & Design

Will Jones talks about how Habito, the leading digital mortgage broker, benefited from using Haskell, some of the wins and trade-offs that have brought it to where it is today and where it's going next. He also talks about why functional programming is beneficial for large projects, and how it helps especially with migrating the data store.

Featured in AI, ML & Data Engineering

Katharine Jarmul discusses research related to fair-and-private ML algorithms and privacy-preserving models, showing that caring about privacy can help ensure a better model overall and support ethics.

Featured in Culture & Methods

This personal experience report shows that political in-house games and bad corporate culture are not only annoying and a waste of time, but also harm a lot of initiatives for improvement. Whenever we become aware of the blame game, we should address it! DevOps wants to deliver high quality. The willingness to make things better - products, processes, collaboration, and more - is vital.

Featured in DevOps

Service mesh architectures enable a control and observability loop. At the moment, service mesh implementations vary in regard to API and technology, and this shows no signs of slowing down. Building on top of volatile APIs can be hazardous. Here we suggest to use a simplified, workflow-friendly API to shield organization platform code from specific service-mesh implementation details.

Building Resilience in Netflix Production Data Migrations: Sangeeta Handa at QCon SF

At QCon San Francisco, held in the Bay Area, USA, Sangeeta Handa discussed how Netflix had learned to build resilience into production migrations across a number of use cases. Key lessons learned included: aim for perceived or actual zero downtime, even for data migrations; invest in early feedback loops to build confidence; find ways to decouple customer interactions from your services; and build independent checks and balances using automated data comparison and reconciliation.

Handa, billing infrastructure engineering manager at Netflix, opened the talk by discussing that there are several known risks with the migration of data that is associated with a change in functionality or operational requirements, including data loss, data duplication, and data corruption. These risks can lead to unexpected system behaviour on the migrated state, performance degradation post-migration, or unexpected downtime. All of these outcomes have a negative customer impact, and any migration plan should explicitly attempt to mitigate these issues.

With a large-scale geographically distributed customer base, many organisations cannot afford to take systems fully offline during any maintenance procedures, and therefore some variation of a "zero-downtime" migration is the only viable method of change. Variants of this approach that the Netflix team uses within a migration that involves data include: "perceived" zero downtime, "actual" zero downtime, and no migration of state. Handa presented three use cases that explored each of these approaches.

The first use case discussed was a migration within the billing application infrastructure. This application handles recurring subscriptions, allows customers to view preview billing history, and also provides many other billing-related functions. The billing infrastructure is busy 24x7 due to the fact that Netflix's 130+ million subscribers are distributed all over the globe -- downtime during any migration is therefore not an option. The plan with this first use case was to migrate the underlying data storage operating system and database from an Oracle system running in a Netflix data center to a MySQL service running in the AWS cloud. The migration involved billions of rows and over 8TB of constantly changing data.

Initial experiments with the snapshotting and reloading of data, and also using Apache Sqoop, proved fruitless. Ultimately, a "perceived zero-downtime" migration approach was taken, which involved copying data at the record level, table by table, alongside the operational read and writes undertaken on the table as part of the daily workloads. This approach allowed the migration to be paused at an arbitrary time, and for failures to be caught at the (business) record level. Confidence in the resilience of the migration was established by comparing daily automated snapshots from the source and target databases, by checking row counts, and by running test workloads and reports on the target database before fully switching over to the new target database as the single source of truth.

During the "final flip", the new target MySQL database had to be fully synchronised with the source database before application data writes could continue, which would result in actual downtime. However, the team implemented perceived zero-downtime by allowing read-only access to the new MySQL database while the synchronisation process completed and by buffering writes within a (AWS SQS) queue. As soon as the database state was synchronised after the flip, the contents of the queue were replayed into the application, which in turn issued the appropriate writes against the new target database.

The second use case presented was the rewriting of the balance service, which included a migration of a smaller amount of data, also with a simpler structure, from a Cassandra database to a MySQL database. This was undertaken because MySQL was a better fit for the querying characteristics of the corresponding service.

This migration was achieved by first creating a "data integrator" service as a data access layer between the application service and the underlying databases, which could effectively shadow writes from the primary Cassandra instance across to the new MySQL data store. The data integrator was also responsible for reporting metrics based on any unexpected results or data mismatches between the two writes.

The second part of the migration process involved asynchronously "hydrating" data from regular Cassandra snapshots into MySQL. The data integrator would attempt to read data from MySQL, and fallback to Cassandra (which was still the source of truth) if data had not yet been migrated. Writes continued to be shadowed to both datastores. Metrics of unexpected issues were again captured, and regular comparisons and reconciliations were made between the data stores. Any data with discrepancies was re-migrated from Cassandra to MySQL, with a complete rewrite of customer's data in MySQL even if there was a single issue.

This hydrate, shadow and compare process was continued until the required confidence of correctness had been achieved. The "final flip" in this migration simply involved making MySQL the single source of truth within the data integrator, and removing the underlying Cassandra data store.

The final migration use case that was discussed was the rewriting of the Netflix invoicing system. This required the creation of an entirely new path of functionality and corresponding service that would also require a new data store. This existing invoice data set was extremely large, and also had a high rate of change. The plan was to move the data from the existing MySQL database to a new Cassandra instance, while not impacting a customer's ability to interact with the invoicing functionality.

After analysis and several design discussions, the team realised that for this migration the corresponding data did not have to be "migrated" per se. The decision was taken to build the new service using the new Cassandra database, and run this in parallel with the "legacy" service and data store. By placing a "billing gateway" in front of the two services, a programmatic decision could be taken on which invoice service to call, based on whether a customer's invoicing data had been migrated or not.

Handa and her team "canary released" the new service by incrementally migrating small subsets of data from the old system to the new (for example, copying the invoice data of all users from a small country). Confidence in the resilience of the migration was built by testing in "shadow mode", where writes were shadowed across the two services and the results compared and reconciled at the data store leve. Extensive monitoring and alerting was used to catch the presence of unexpected behaviour and state. The effective "migration" of the data took over a year to achieve as part of the operation of the normal invoicing functionality, but there was zero downtime and no impact to the customer.

Summarising the talk, Handa shared the key lessons that her team had learned: aim for zero downtime even for data migration; invest in early feedback loops to build confidence; find ways to decouple customer interactions from your services; build independent checks and balances through automated data reconciliation; and canary release new functionality "rollout small, learn, repeat".