Durability theater or real world safety? How to engineer against data loss

Durability and availability of user data are of paramount importance at Dropbox. Replication and backups are essential strategies for durability in any storage system, but they’re insufficient to protect against failure modes like operator errors or software bugs. Recovery mechanisms are critical, but only provide safety when corruption is detected in time. What are the real threats that impact reliability at scale and what mechanisms can be used to prevent them? Where does durability theater end and real reliability begin?

This talk will cover common risk factors for downtime and data loss in distributed storage systems, along with strategies for protecting against them, based on experience gained building storage systems to manage hundreds of petabytes of data. It will cover design principles for architecting ground-up storage solutions, along with tooling and operational support for managing storage infrastructure at scale.

This session is sponsored by Dropbox

James Cowling

Dropbox, Inc.

James Cowling is the technical lead and manager of the Storage Infrastructure team at Dropbox. He spends his time thinking about durability, exabyte-scale storage systems, and simple solutions to difficult problems. James received his PhD at MIT specializing in distributed transaction processing, and conducted research in distributed systems, fault tolerance, and consensus protocols. He left academia for industry to build the systems he used to research, focusing on the practical realities of large-scale systems design.