Volume 2 of a series on Cloud System Administration is about designing and operating large distributed systems. Who should read it?

The goal of this book is to help you build and run the best cloud-scale service possible, though the authors concentrate more on general distributed systems than they do on cloud specifics. The authors say the end result of their ideal environment is that business objectives are met, which they admit is quite boring. However, as most developers and system designers will know, is a goal that is met far too rarely.

The book is divided into two main parts:

Design : Building It and Operations : Running It

The building part begins with a chapter on the issues of designing in a distributed world, considering three main options for design – load balancing with multiple backend replicas; a server with multiple backends; and a tree of servers. The idea of CAP (consistency, availability and partition resistance) is discussed, and the chapter gives a good high-level view of distributed system options. Having given this conceptual overview, the authors then move on with a chapter discussing designing for operations where they cover operational requirements such as queue draining, replicated databases, hot swaps, access controls and rate limits. Next, the authors look at selecting a service platform, with factors such as the level of service abstraction, physical versus virtual machines, and how resource sharing levels affect compliance, privacy, cost and control.

A chapter on application architectures starts with a single web server, and moves up through multi-machine designs to cloud-scale services. The authors then look at message buses and service-oriented architectures. There are then two good chapters on design patterns – for scaling and resiliency.

Part II considers operations once the design has been implemented. The running of operations in a distributed world is covered first, followed by an interesting chapter on the rise of the devops culture, where developers and operational engineers work together as one team that shares responsibility for a website or service. Building and deploying service delivery is next on the agenda, and there are good chapters on upgrading live services and automation, contrasting the different levels of automatic system management and their benefits and problems.

From this point onwards the book moves more to the nuts and bolts of ongoing system management; there are chapters on ‘oncall’ (the practice of having a group of people take turns for being responsible for emergencies) ; disaster preparedness; monitoring system fundamentals; and monitoring architecture and practice. Capacity planning, KPIs, and ‘operational excellence’ bring up the end of this part of the book.

The book also includes Part III made up of appendices, including a useful set of assessments laying out questions to ask and things to look out for in the various areas of operational responsibility detailed earlier in the book.

Click cover image to buy from Informit

Conclusion

Overall, this book does a good job of describing the concepts and processes of designing and creating distributed systems. Despite the ‘cloud’ in the title, it’s not particularly cloud-based; this is just distributed systems. From a programmer’s perspective, the discussion of devops is interesting, but the book is more one to give to non-techies. On the whole, this is a book for people who need to put together distributed systems and who don’t know that much about them.