A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. ~Antoine de Saint-Exupery -- Note, the opinions stated here are mine alone and are not those of any past, present, or future employer. --

Saturday, October 22, 2011

During one of my first few weeks at eBay I got involved in a conversation about mark down logic. Now, I had only been at the company for a short while and I was working for an e-commerce company, so I assumed mark down logic must be some business rules about price discounts. Certainly seemed like a reasonable line of thought. As it turns out, I was completely wrong and as a result got introduced to a concept that is critical to the ability of any site to achieve high levels of availability.

The term came into existence inside of eBay because the DBA's wanted a mechanism to tell the applications that the database was down, regardless of the true state of the database. They wanted to mark the database state as down. The original motivation for this was to deal with challenges with the database listener. With hundreds, at that time, of application servers, all waiting for the database to come back up, the moment the listener was turned on, a connection storm would hit and often cause the database to go down again. Rather than requiring an involved set of start up procedures that would effectively be a total site reboot, the desire was to be able to control the rate of application connections.

There are actually two concepts that have to be considered here:

The ability to mark an external resource state as up or down and have the application honor that state.

The ability of the application to behave in a defined way when the resource is down and to return to the proper behavior when the resource returns, without being restarted.

The first one is relatively straightforward as long as your application manages the connections through an abstraction, such as a connection pool. Attempts to get connections receive a "marked down" exception which then part 2 needs to handle. It clearly is more challenging if you haven't created any abstraction for dealing with external resources. For example if REST services are called in a variety of ways with no common path to establish the HTTP connection, then supporting mark down becomes more problematic. There are many other reasons why you would want to provide a common HTTP connection path anyway (e.g., managing time outs, retries, consistency of configuration, etc.).

There are nuances to supporting mark down however. The first is dealing with how you will change the state. For a small deployment, something as simple as an HTTP POST on an administrative listener could be used. I've also seen a configuration file with a watch for modifications. Tools like Puppet can then be used to push out state changes. This works well for small deployments. Larger deployments would benefit from configuration service tools like Zookeeper.

The second concept is much more involved. The challenge faced here is that applications need to behave in a predictable way when a resource becomes unavailable. I chose predictable here because the actual behavior is going to vary considerably with the application logic and the down resource. While simply returning HTTP status 500 may be predictable, that's not what I mean and is usually not sufficient to be considered robust.

One of the most challenging but important considerations is what can the application do without the resource that is down? The simple minded approach is to simply state, nothing and return the equivalent of service temporarily unavailable. This may in fact be the only option depending upon the scenario. A more robust approach however is to design applications to try and make as much forward progress as they can with the resource missing. Design the application with resilience to missing resources. Think about what it could do if you took the resource away. What functionality could still be provided?

Equally challenging is managing state that might get confused if the resource becomes unavailable. When exceptions start coming back from database connections or REST services, the internal state of the application could become corrupt. The result is that even though the resource has returned, the application is unable to use it correctly and ultimately has to be restarted itself.

This brings me to another important point. The only way to make sure that your application can behave predictably and recover correctly from resources going down it so test it. Testing mark down needs to be a standard part of the application regressions. Netflix has taken it to the ultimate state by creating a Chaos Monkey. They turn it lose in production with the sole purpose of randomly killing things and making sure their applications can survive. I'm a fan!

Sunday, October 16, 2011

Database storage is expensive. This is especially true if you build a traditional SAN based M+N cluster. The cost of the storage array, fiber channel switches, fiber channel interfaces, drives the cost per terabyte into the thousands quite easily. And while storage costs in general are plummeting, SAN storage costs are falling at a slower rate, widening the gap between SAN and direct attached storage. Given the cost of SAN storage, it would be unfortunate to waste it which is what we discovered we were doing.

Our platform makes a lot of 3rd party service calls. Several of these are very complex conversational interfaces that generate a lot of text. In order for customer service to trouble shoot customer issues we retain these API interactions. Storing this 3rd party API request/response text was implemented from the beginning within our platform. At that time, the logical place to save this data was in database CLOBs. When we recently analyzed our SAN storage, we discovered that 40% of it was consumed by these API logs. Clearly there was an opportunity to save costs with a lower cost solution.

We looked at alternatives for managing LOBs and we settled on Cassandra for a few reasons:

There were a few concepts we wanted to standardize for the storage and management of large data objects. These were:

Correlate the importance of the data to the business with the cost of storage.

Ensure that life cycle was applied and LOBs weren't kept longer than meaningful.

Ensure that data was as space efficient as possible.

The first was a new concept for us. The cost advantage for using Cassandra comes by using commodity hardware with commodity drives. This hardware can and will fail though. So to ensure data cannot be lost, there must be multiple copies. Redundancy comes at a high cost however. For example, if the cost of storage is $150/TB and you keep 6 copies of the data (3 each in 2 data centers) then your protected cost is $900/TB. Reducing redundancy increases the risk of a loss, but some data can afford to be lost or at least afford to be temporarily unavailable. We wanted to be able to trade off data importance against cost. We defined 3 levels of consistency and corresponding replication values for each.

We also require that every LOB is provided with an expiration date. That can be set to never, but by providing a simple way to control a meaningful life of data, we increase the likelihood of it being purged when no longer useful. For example, we can retain travel service debugging information for 6 months after the trip is complete. This is trivial for the caller to set and Cassandra will clean up the data automatically after the expiration.

Another observation was that much of the text wasn't compressed even though it was highly compressible. When we designed the LOB service, the interface included both content type and content transfer encoding. Based on the type and the transfer encoding, we will transparently compress and decompress the data. Our Cassandra storage costs are about 1/3rd the SAN, for the highest replication level, while compression reduces the storage needed by 90% giving us an order of magnitude improvement in storage costs for text LOBs.

Our early usage of the Cassandra based LOB service has produced positive results. Cassandra has proven to be reliable and performant. We have even experienced a hardware failure, during a peak transaction period without impact to the platform. We plan on expanding both the usage of our LOB service as well as Cassandra based on the early results.

Saturday, October 08, 2011

Follow the technology business news and barely a day goes by that somebody doesn't announce a new or refined social commerce or recommendation product. The concept is quite simple. Use my social graph to filter goods, services, and content making it a bit easier to get through the overwhelming volumes of each. The premise is that if my friends like it, then so will I. That premise however seems dubious to me. Even if you limit it to my closest friends, the reality is that the overlap in our taste across a broad variety of content is very small.

For example, a small group of us may really enjoy taquerias. In fact, I have a circle of friends that share notes on taquerias. And if they recommend one, I trust their recommendation implicitly. So far, so good. But now let's move on to music. This same small group has diverse musical taste. Some of it works, but some of it wouldn't make any of my play lists. So even in a small social graph, the ability to leverage it for recommendations can break down quickly.

How do you solve this problem? Clearly there needs to be another element to these filters if they are going to actually provide better answers instead of just different answers. I believe the answer lies in applying semantics. Semantics adds meaning to content and makes it possible to do more accurate and relevant matching.

Semantics focuses and associating concepts with content and defining the relationship between concepts. These relationships become critical because people are often ambiguous or use different terminology when looking for content. I might look for a taqueria but somebody else may simply look for tacos. Semantics allow the concept that a taqueria serves tacos to be established, thereby making it possible for both of us to find appropriate restaurants. The transformation from simple keyword searches (which I like to call "clumps of letters searches") to concept searches changes the relevance dramatically.

The difference between semantic and social is most clearly demonstrated by Pandora and Genius or Spotify. Pandora classifies music on over 400 distinct attributes. The music genome is one of the richer semantic systems available for a specific domain. When you select music you like or don't like, Pandora uses the concepts associated with each song to build a profile of what music you like. Genius on the other hand compares the songs you like with the songs that other people like who also like your songs. This works reasonably well if the people involved have a pretty narrow taste in music. But if their taste is broad and goes in a different direction than yours, it is less effective.

It isn't social vs semantic though. The intersection of the two is incredibly powerful. Going back to my earlier statement about taquerias, if a friend who likes taquerias recommends one, I will absolutely go there. The power is in using semantics to narrow the content I see, then expose members of my social graph that like the same content. Based on how well I know and trust them, it can boost the my confidence in the recommendation. In fact, when semantic relevance on two pieces of content are equivalent, the social graph can serve as a tie breaker.

The opportunity isn't purely semantic or purely social, but rather apply semantics to provide contextual relevance and then leverage the social graph to increase confidence in the results.

Saturday, October 01, 2011

If you ask people for their recommended reading on agile, you'll probably get a list of books on agile process, blogs by agile experts, and social media recommendations to follow that will be full of agile coaches. And, I wouldn't discourage you from reading any of them. In fact, I should probably read a few more myself. But, agile is rooted in good leadership. So, for me, the starting point for any agile reading is understanding how to create better leaders.

The most influential book for me is Drive by Daniel Pink. Daniel lays out the three intrinsic motivators for all humans. They are shockingly simple and yet can be incredibly hard to achieve at the same time.

Autonomy - We all want to feel in control of our own destiny. We want to believe we can influence the things we do, that we can make a difference.

Mastery - We all need to feel that we are masters at our craft. That we are respected for our abilities.

Purpose - We want to do whatever it is for a higher meaning. It should be for something bigger than ourselves.

The first two concepts align incredibly well with the agile manifesto. Much of the manifesto is about taking control from the processes and giving it to the individual. Pair programming and code reviews are closely aligned with increasing mastery. The book covers this material and the supporting evidence in great detail. If you don't have time for the book, Dan does have a TED Talk. And if you can't even spare 18 minutes to hear the material, there is also a 9 minute RSA presentation on the material.

D. Michael Abrashoff was commanding officer of the USS Benfold. From this experience he wrote It's Your Ship which is fascinating because he demonstrates how he could create an agile organization within the strict command of the Navy. Think your company has strict process and policies? More strict than the military?

Abrashoff's story is remarkable in both the way he ran his ship and the results they achieved. He put the responsibility of the ship in the hands of his crew. One interesting parallel to agile was his training which involved a primary, secondary, and tertiary person for every single task. Training was done by pairing individuals until there were 3 people available to do any task on the ship. Imagine if every software component had 3 people that were confident in changing or fixing it.

While neither of these books are specifically about agile development, they are about creating an agile culture. In Switch by Chip and Dan Heath, you learn that change requires two things. Both the emotional being and the thinking being must be aligned. They use the analogy of the elephant and the rider. The rider knows where he wants to go but if the elephant isn't in agreement, there will be a wrestling match the rider will ultimately lose. This is equally true for organizations. To move to agile, you have to not only reach the intellect but the emotion. To do that, you have to appeal to the intrinsic motivators and Abrashoff shows you can do that within the structure of any organization.