A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. ~Antoine de Saint-Exupery -- Note, the opinions stated here are mine alone and are not those of any past, present, or future employer. --

Sunday, October 10, 2010

Foursquare had a well discussed outage last week. This wasn't good news for Foursquare but it can be for the rest of us. By looking at what happened, we can all learn steps we can take to avoid a similar occurrence. I do want to state emphatically, I am not writing this to criticize Foursquare for anything. Scaling a platform is a tricky problem, often wrought with problems that are difficult to predict but completely obvious once they've occurred. Rather the point of this article is to discuss the lessons that I took from the incident to improve future development at Rearden.

Lesson 1 - Instrument Everything

If you do something or use something, log metrics about it. You have entities, components, services, and data stores. Your system is sent requests and events. Every request and event should be instrumented with timings, resources they use, and any metrics on any interesting volumes of work they do (render page, compute a graph, etc). The simplest rule that I follow is that if involves an interaction outside your process or the consumption of a precious resource (e.g. memory, files, sockets), there should be logs generated to track it. Of course as I discussed last week, Flume can make this more scalable and provide you with a frame work to...

Lesson 2 - Analyze Your Logs

Generating rich telemetry with all of your instrumentation is of limited use if you don't actually mine it. What exactly are you looking for though? Logs can be voluminous and a bit overwhelming. Again, a simple answer is any outliers. Most of us hope for systems that behave in a uniform manner, and in fact most do as the number of users and transactions grow into millions and beyond. Unfortunately outliers will exist but they are often leading indicators of problems and are therefore worth identifying and understanding.

Never under estimate the power of simple graphs either. Suppose the graph below that charts response time (could be application or database) per shard. The blue and green shards are reasonably close in response time while the red shard clearly is responding slowly at times which is a cause for concern. The gold shard though is struggling badly. There are clearly many ways that this fact can be determined with analysis, but simply plotting it allows it to be immediately visible. Graphing obvious metrics can be incredibly insightful, often giving clues as to what other data might be worth analyzing.

Lesson 3 - Partitioned Availability is Tricky

Partitioning clearly helps performance but also offers the opportunity to partition your availability. If a partition goes down, it takes those users down but other users can stay up. The theory is sound but actually implementing it is significantly harder. You have to build your components to correctly handle shards coming and going. You have to understand your dependencies completely because one wayward cross shard dependency can render your entire plan useless. Most importantly, you have to test your failed shard availability regularly, potentially before each deployment. The ease with which an unexpected dependency can slip in is surprising.

Lesson 4 - Perfect Storms Will Happen

Instrument your code, analyze the metrics, design your dependencies carefully, and test your system thoroughly, but none of it will ensure constant success. Turning on your servers, pointing them at the Internet, and encouraging them to use your product is an unforgiving endeavor. It will throw unexpected traffic, non-uniform distributions, and freak artifacts at you regularly just to remind you who you serve. We have to continue to learn from our own and other's disasters, improve and refine our designs and implementations, and realize that like all "disasters" where no lives are lost, they make for great stories after enough time has passed to forget the stress and embarrassment of the moment.

Sunday, October 03, 2010

I've been looking at Flume in depth and it is a very powerful and useful platform that solves several operational problems at once. And the best part is that it is surprisingly simple to use and understand.

Application logs are a source of stress and contention in companies. The site operations team usually views them as being a painful resource to manage. They consume a lot of space, they are rarely where the people that can use them, the developer, can access them, and they are of nominal use to operations. Most companies wind up building tools and processes to gather them off application servers, push them to some repository, and try to keep the lifecycle under control so they don't consume limitless disk space.

Developers are equally frustrated because the logs are often not where they can view them, they have limited ability to intelligently manage the lifecycle, and limited tools for processing their logs (if you want something more than grep and PERL).

Flume addresses the frustration of both groups by providing a simple yet powerful framework for pushing logs from application servers to repositories through a highly configurable agent. The flexibility of Flume allows it to scale from environments with as few as 5 machines to environments with thousands of machines.

Before I go into how we will be using Flume at Rearden, I want to cover another useful aspect of Flume. Flume defines logs as a series of events. A Flume event will look familiar to Java developers as they consist of a priority and a string. And the Flume event priority maps very easily to the Java log level, so events are an obvious candidate for collecting Java logs. Flume events though also include the concept of a host which Java logs don't directly represent although it could be added through the use of a custom formatter. Flume events also can contain fields which are arbitrary key/value pairs that can contain structured context associated with the event. This is where Flume events can provide capabilities that go beyond the standard Java logging framework (although is part of slf4j's extensions).

One frustration I've always had with Java logging (and most language logging) is I often find myself formating metadata into a string for logging and then later using regular expressions to get it back out of the string into metadata. Beyond being an inefficient use of resources (including my limited skills with regex), it's error prone and frankly feels pointless. Wouldn't it be so much easier if I could send the metadata as structured content and have it be saved in a system that made efficient processing of such data simple. Like for example, something like Big Table. This is precisely what Flume offers.

How does Flume accomplish this? Flume has two basic concepts. The Flume master acts as a reliable configuration service that nodes use to retrieve their configuration. The master will dynamically update a node if the configuration for the node is changed on the master. A Flume node is simply an event pipe. It reads from a source and writes to a sink. The behavior of the source and sink determine the role and characteristics of the node. Flume is delivered with many source and sink options, but if none serve your needs, you can write your own. The Flume node may also be configured with a sink decorator that can annotate and transform the event as it passes through. With these basic primitives, a variety of topologies can be constructed to collect data on an application server and route it to a log repository.

Flume provides two patterns for nodes that most organizations will use. The first is the agent. Agents are end points for applications to send events to. An agent most often will run on the same application server as the application. This creates a simple availability model as the application and agent will have approximately the same availability (assuming the Flume agent itself doesn't crash more often than the application). Agents send their events to collectors which is the second pattern for a Flume node. Collectors often deliver events directly to the log repository although larger installations may have tiers of collectors for scale and routing reasons.

At Rearden, we are following a simple Flume architecture as illustrated below. Each application server will run an agent. The agents will send events to a pool of collectors which push the events into Hadoop HDFS. The agents and collectors are deployed in our production environments while HDFS is deployed in our QA environment. Developers are able to access HDFS directly in the QA environment. We have developed a log4j adapter that sends all Java logs directly to the agent. Additionally, a simplified Java Flume API is exposed that wraps the Flume Thrift API and manages configuration and common meta data consistently (time stamps, host name, etc.). Applications may leverage both interfaces to generate Flume events.

One challenge that companies face when passing logs from production to development environments is sensitive information my inappropriately wind up in a log which would leak it out of production. This is where a sink decorator on the collector can be employed. Our collector uses a decorator that looks for suspect patterns and replaces them with an HMAC of the original string. The events are logged in secure production repository as well for review by information security, allowing tickets to be filed against application code that is leaking sensitive data.

Space on the HDFS cluster is managed by developers. The lifecycle of logs can be managed anyway teams feel is appropriate. The tension over log space is greatly reduced because engineers are now in control of what to keep and how to manage the space appropriately.

Storing logs in HDFS opens up a richer toolset to engineers for analyzing and reporting on them. Obviously they can write their own map/reduce tools, but additionally, HIVE and PIG can be used directly. What may take hours of development PERL can be expressed in a single PIG query. The hope is that this will allow developers to uncover patterns in application behavior that previously was difficult, leading to better application quality.

I have dramatically simplified all of the power of Flume in this article. If Flume sounds intriguing to you (and it should) I encourage you to spend some quality time with the Flume User Guide.