Designing and Implementing Asynchronous Distributed Systems: Challenges, Strategies, and a Million Things That Go Wrong

Following an explosion of interest in asynchronous programming techniques, the emergence of languages and frameworks to match, and the growth of “push” as a design element contributing to a UX users now expect on both web and mobile, we find renewed focus on the design and implementation of asynchronous systems and data processing layers. Such systems lie at the heart of many modern messaging, analytics, and web services.

However, engineers and dev/ops face many challenges in implementing and supporting such infrastructure, from conceiving an architecture, selecting platforms, ensuring application availability + performance, and pushing the heavy lifting out-of-band. While these challenges are difficult, an emphasis on systemic thought is critical, requiring of one the capacity to reason about the distributed but deterministic behavior of client- and server-side interaction at every level of the stack. Distributed asynchronous systems are beautiful in motion, but only cool-headed systemic reasoning can save you in a tailspin.

This proposal is language-agnostic, focusing instead upon concepts and strategies applicable to many programming languages with specific examples in static languages like Java/Scala, conventional dynamic languages such as Python/Ruby, and emerging platforms such as Node.js.

Primary components include:

1) Architectural considerations for designing a real-time distributed edge tier supporting one to one million clients per server.

2) Strategies for building a distributed worker / processing tier to support real-time, potentially streaming communication with clients connected to edge nodes on the network.

3) A litany of unwelcome surprises, challenges, and failure modes common in such distributed systems, along with strategies to identify and mitigate them.

2010’s engineering community saw a flood of debates regarding the merits, complexity, and scalability of evented, threaded, and corountine-oriented architectures. These concurrent IO strategies are best framed in terms of benefits and tradeoffs, including factors such as ease of implementation, language preference, debuggability / testability, and multicore scalability. This proposal includes a survey of these options along with their strengths and implicit tradeoffs, concluding that simple evented/coroutine-oriented systems may offer easier implementation, but that a hybrid threaded/evented design will generally offer better performance and can take advantage of manycore systems.

Regardless of the implementation chosen for an edge server, the parlance of “never blocking” in an IO loop to guarantee an open channel of client communication tends to necessitate a separate processing tier as a baseline. When fast-moving data streams are involved, CPU-intensive processing, or simply for the sake of separating concerns, pushing the processing of this information back to a distributed worker tier enables teams to scale processing, messaging, and analytics tiers independently from edge nodes. This proposal discusses two approaches – a map/reduce strategy for processing very large jobs requiring extensive compute power and memory, and a distributed worker strategy for small but latency-sensitive tasks with an open-source reference implementation called “Octobot” <http://octobot.taco.cat>.

Distributed asynchronous systems differ from serial, central systems in that it can be tremendously difficult to reason about and account for a variety of error states that can occur. Put another way, there are few tasks more sweat-inducing than identifying the cause of a cascading failure in a production system pushing 10+ megabits of logs per second as pagers buzz off the table. Here, anecdotes of several failures in fast-moving production systems provide a jumping-off point for a discussion best practices. These include spec’ing out the behavior of clients explicitly in documentation and in code as a state machine; applying thorough instrumentation to all server systems and queues (both explicit and in-app); applying monitoring wherever possible; sampling/profiling systems at runtime to identify variance in advance; relentless analysis and graphing of logs and historical performance data to identify trends/bugs and provide for capacity planning; and considering the implications of 2x, 10x, and 100x growth in clients / events processed.

Wrapping Up –

Asynchronous distributed systems lie at the heart of many modern messaging, analytics, and web architectures. While driving these real-time services can be difficult, they can offer unmatched availability and scalability (both vertical and horizontal) when built well. However, sustained systemic reasoning is critical to their design, implementation, and management. Though many emerging languages and frameworks ease the implementation of specific components, planning, foresight, and monitoring are essential to proper operation. Following many bruises and a lot of hard work, we’ve found that the strategies described above have enabled us to create a robust, scalable, resilient distributed infrastructure.

People planning to attend this session also want to see:

Scott Andreas

Boundary Inc.

Scott Andreas is an Engineer at Boundary, Inc. hell-bent on quality, efficiency, and performance in highly-concurrent network programming and asynchronous distributed systems.

With a background in Java, Scala, and Ruby and new production deployments in Erlang, his current work involves designing, implementing, and deploying event processing systems with a team of engineers to create a real-time distributed network analytics platform. Previously at Urban Airship, Andreas worked with a team to design and implement a scalable mobile messaging platform backed by a clustered service designed to serve millions of concurrent clients on a handful of commodity servers.

Red-lining services under development, measuring results, and quantifying improvements in terms of infrastructure cost and business value is a favorite past-time.