Home

Architect Summit 2008 - SummaryOrganized by Randy Shoup, Nati Shalom, Alexis Richardson & John DaviesBackground
Towards the end of October while bank employees were lining up on the window ledges a distinguished group of architects, CTOs and techies from today's (and a few of last year's) major industries met in a boardroom at eBay's head office in San Jose, California. The goal, simply to discuss today's technologies and architecture.
The idea for the meeting came from a similar meeting held in London in December 2005,it's fare to say that the 2005 meeting was extremely influential, it changed and fortified many of the products represented by the attending CTOs and architects. Bruce Eckel hosts a similar meeting in the stunning surroundings of Colorado and many of today's"thought leaders" are inspired from these events. This time, out of fairness to Americans previously suffering with a very low dollar we decided to hold the event in California and eBay came to the rescue.
As with the previous meeting the goal was really to get an influential group of architects together and just chat through ideas. In 2005 in invites were sent to CTOs, architects and tech leaders who the organizers through could influence their own products or companies.Mixed in were a few very respected architects from larger industries, of course being London this was mainly Investment Banks. It was because of this that Dresdner (DrKW)Bank, now fully owned by Commerz Bank offered to host the 2005 meeting and we were very fortunate to have an excellent presentation from the CIO, JP Rangaswami, kindly posted here
The 2008 meeting had exactly the same goal as our 2005 meeting, to quote Alexis Richardson "The aim is to have an interesting few days talking about fun stuff and learning from it by hanging out with cool people. It's basically a technology sleep-over party". Being in the San Francisco Bay area we took a slightly less banking-centric view and invited stronger representation from the local industries. To keep it even we invited a several investment banking architects but timing was bad so many decided to stay put and pack their desks into small boxes. By the first day through we were very well represented from consumers and vendors from internet companies, telcos, ISVs, and a sprinkling from the financial section.
EBay stepped in to host the event, brilliantly organized by Randy Shoup. The date was set, flights were booked and we started to organize the content.
The format of the meeting was unlike a regular conference, there were no presentations,just long "brain-storming sessions" that were slightly moderated by a session moderator. Each session lasted roughly two hours and sparked a discussion where each of the participants commented on their view on some of the points. In this way the session was self-balanced to cover the points of interest of the people in the room. It also created a unique atmosphere where everyone took part in each session, obviously some more than others. There were lots of side talks and very interesting discussions during the breaks and evenings. Interestingly enough, most people reported that they found it to be one of
the most productive sessions they took part in and they learned much more than in any of
the conferences they'd participated in.
This is the first session, the others will be included in subsequent parts
Partitioning...
Partitioning is the first-order task in a distributed system: how to divide the overall problem into smaller chunks in order to distribute those chunks over multiple machines.When we say partitioning, we mean simply dividing the problem in some way -- by data, by processing, etc. Motivations for partitioning can include better performance, scalability, availability, manageability, and latency. An eBay architectural mantra is "if you can split it, you can scale it."Functional partitioning divides the problem by entity or by processing system ("service", in modern parlance). Horizontal partitioning includes read-write partitioning, sharing, and hashing, which all divide the problem by data. Load-balanced systems are the simplest example of horizontal partitioning.
1. Abstractions and Partition Awareness
A horizontally-partitioned system typically provides an abstraction that makes the partitions appear as a single logical unit. eBay and Flickr, for example, both use a proxy layer to route requests by a key to the appropriate partition, and applications are unaware of the details of the partition scheme. There was near-universal agreement, however, that this abstraction cannot insulate the application from the reality that there partitioning and distribution is involved. The spectrum of failures within a network is entirely different from failures within a single machine. The application needs to be made aware of latency, distributed failures, etc., so that it has enough information to make the correct context-specific decision about what to do. The fact that the system is distributed leaks through the abstraction.
2. Repartitioning
Any horizontally-partitioned system will almost certainly need to be re-partitioned at some point. Re-partitioning on the fly requires some sort of proxy layer, and has a small but unavoidable impact to performance. The benefits are increased isolation, improved availability, and reduced risk. Make the unit of movement as small as possible: smaller partition units yield faster migration and more isolation. One technique is to mark a partition read-only, migrate it, and then mark it read-write again. Another technique is to grow by subdividing partitions.
3. Partitioning and Consistency
See Immediate vs. Eventual Consistency section in the next chapter.
4. Partitioning and Failures
Distributed systems will experience failures, and we should design systems which tolerate rather than avoid failures. It ends up being less important to engineer components for low failure rates (MTBF) than it is to engineer components with rapid recovery rates (MTTR).
Interestingly, the majority of real-world production errors come from operator error and code bugs. Hardware failure is a distant third.
Consistency in a distributed system
It is surprisingly difficult to find a simple and complete definition of consistency in a distributed system. The most useful definition appears to be that consistency is a state when all updates are complete and all nodes in the distributed system have the same state. Consistency becomes an issue when scaling out stateful applications. Intra-partition consistency is relatively straightforward to achieve, while cross-partition consistency guarantees involve trade offs with availability. Distributed transactions (2PC,3PC) are the bane of scalability and availability.
Inconsistency is inherent to asynchronous systems. Inconsistency can also be caused by network buffer issues, network partitioning, network failures.
1. Immediate vs. Eventual Consistency
By the CAP Theorem, no distributed system can guarantee consistency, availability, and partition-tolerance. These forces inherently trade off. For example, synchronous systems compromise consistency, while synchronous systems compromise availability.
However, global immediate consistency is rarely required in real-world systems. Many if not most consistency problems are solved already at the business level, and can be made substantially simpler through some sort of overarching business rule. Eventual consistency is the norm rather than the exception in large-scale systems. Telco billing is made eventually consistent through a monthly billing cycle. Cross-institution financial transactions are made eventually consistent through end-of-day settlement, three-day clearing, reconciliation, and manual resolution of conflicts. In systems with multiple parallel writers, 99% of the time the appropriate conflict resolution strategy is simply last-one-wins.
Without distributed transactions, all consistency is probabilistic. A critical attribute is how consistency diverges when the system is under stress.
2. Data Denormalization
Data duplication is common in large-scale partitioned systems: In Flickr for example auser's comment on a photo might be written both with the photo and with the user. The writes are made eventually consistent through asynchronous events.
3. Geographic Distribution
As a practical matter, it is best to try to avoid the need for immediate consistency across a geographically dispersed network. Prefer local consistency, with asynchronous propagation as needed to other geographies.
A couple of interesting Points discovered during this session
Apple's Macbooks outnumbered PCs almost 3 to 1,
Flickr deploys new code about 10 times per day. Cal: "Our last deployment was 8minutes ago."
Day 2 SummaryData Management discussion
Day two of the meeting started with a discussion on data management patterns. We started the discussion with an attempt to answer what brings people to look into distributed data management in the first place. We laid out three concrete options:
-Performance
-Latency
-Scalability
We started the discussion with the thought that all points are equally important but we quickly agreed that in reality the cost of addressing all at the same time may not always bea good fit for everyone. For example, Memcached is a good example of a solution that was designed to address performance and scalability but completely ignore dealing with the reliability and consistency aspects. On the other hand, all the vendors in the Room(GigaSpaces, Oracle, Terracotta) spent a lot of effort ensuring the reliability and the consistency of the memory-based solution. The adoption of both types of solutions indicates that people are willing to trade some of the functionality for cost or for other reasons and all-or nothing reasoning doesn't seem to apply here.
John Davies raised another driver for moving to distributed data management: availability/reliability. The fact that your data is not stored in a centralized location means that the chances for total failure are smaller. This sparked a small debate on whether indeed reliability is one of the driving forces or just a feature and not necessarily a driving force.During this debate it was clear that most people have a slightly different definition in mind when they use the term reliability. John Purdy suggested a definition for reliability that was quickly accepted by everyone in the room:
Availability/Reliability can be broken down into the following properties:
* Durability
* Consistency
* Availability and
* MTBF
After we agreed on the baseline of what brings people to look into distributed data management in the first place and a basic definition for availability, the discussion moved to cover other aspects of distributed data management.
What is the impact of distributed data management on latency?
* Affinity – in many cases the number of hops required to access the data has a strong impact on latency. Ensuring that the business logic is close to where the data is can reduce that overhead significantly. That statement holds true for any form of data management, distributed or centralized. It becomes more interesting when dealing with distributed management scenarios, since the data can be spread over the network and therefore you may have different latency for different sets of data. A common option to avoid this overhead is to execute the business logic where the data is. A service-oriented architecture can provide a good example of co-locating logic and data, where a service is responsible for both its data and the operations performed on it. An algorithmic trading
application is another example – each node is responsible for very fast processing on a
subset of the overall data.
*Latency under load – contention on the same lock or a large number of concurrent users increase the time it takes for the data server to serve these requests, and impacts the overall latency. Distributed data management can smooth out the impact of these two factors by spreading the load and contestations between multiple data partitions.
*Latency vs. consistency tradeoff – guaranteeing consistent, ordered operations requires serialization, which increases latency. There is an explicit tradeoff to be made here – one can improve latency, but at the cost of relaxing consistency.
*Latency and multiple replicas – for read operations, reading data in parallel from multiple replicas can improve latency, because you take advantage of the faster responders. (If you need to read from only k replicas in order to be sure you are reading the correct value, the overall latency is the latency of the k'th fastest, not the overall slowest.)
Is data distribution a leaky abstraction?
There have been many attempts to make the transition from centralized data model to distributed data model seamless through abstraction. There is a good chance that things wouldn't work as expected with this level of abstraction – for example, join queries work very differently if all the data is placed in one centralized location or if it spread across distributed data partitions. The same thing applies for any aggregated function such as SUM, AND, or any blocking operations.
Hiding these details from the application can lead to bad design, which in many cases will only be discovered at later stages of the application development. On the other hand,forcing explicit change on the application can also be a painful process. So the question is,at which level abstraction can become useful and at what point it becomes "leaky". I suggested that the best measure is the chances of failure. If in 80% of the cases chances are that users will choose the wrong option – this is a clear indication for a leaky abstraction. There are various way to deal with that:
1. Avoid any abstraction and force explicit change in the application code.
2. Provide an abstraction, but throw a warning when there is a chance for bad use.
3. Provide an abstraction, but also explicit semantics (through annotations or special query semantics) for dealing with distributed data management, such as affinity semantics,semantics for parallel execution and map/reduce semantics.
In a follow-up discussion with John P. and Shay B. it seemed that additional semantics (affinity, parallel, map/reduce) to JPA could serve as a good starting point for such an abstraction for Java developers.
What will be the database's role in future architecture?
Databases are largely valuable in providing logical-physical abstractions, and are probably not going disappear from our world any time soon. However, at the same time it is clear that databases are not going to serve as a general-purpose solution for all data requirements. Moving to distributed data management is becoming more common, as value of data grows on the one hand, and the need for scaling and faster performance for
processing the data emerges on the other hand. Currently there are three main
approaches to this challenge:
Distributed database (Similar to Oracle RAC) – with this option the database is broken into partitions and provides a single driver for interacting with these databases.
Distributed Caching – in this case the database is kept centralized and offload a large part of the read load by putting the data in read-mostly cache.
In-Memory Data Grids – in this case a data grid is used as the system of record instead of the database. The database is used as a background service which maintains the data in a durable storage and used as a read-only access point.
All options require a change – a common fallacy is that if you're already using a database,choosing database partitioning is the most natural next step because it is "seamless". If you choose to change a database from a centralized model to a distributed model, you will most likely need to re-design your schema domain model. You will need to change all applications accordingly to conform to this new schema change. Since in most cases the database is used as a central piece that serves lots of applications (online, reporting,legacy), this change is going to have large impact.
On the other hand, caching seem to be a simpler tactical change that helps to optimize the existing system and allows you to apply the change only for the application that needs it most. In this case, the change is significantly smaller, because you only need to change the read queries that hit the database most.
Data grid goes a step further and provides the ability to completely decouple the database while at the same time keeping the database in sync with all changes made into the data grid. This solution is good for read/write scenarios and for cases where most of the frequently-used data can reside in-memory.
Every decision between these options has trade-offs and costs associated with it. In reality you'll rarely go through a process of examining all the options and the cost associated with each one of them. For example, choosing MySQL over Oracle will not always be cheaper if you already have Oracle in place, trained DBAs, etc. The license cost is only part of TCO and not always the most significant one. The same applies to Memcached or any other open source solution. Unfortunately, people rarely apply these cost measurements in their decision-making process (especially when dealing with free-license products). Applying these measures makes it much more likely to discover, sooner rather then later, that what seems to be a natural solution might not actually be the best suited for your needs,compared to the alternatives.

There was one thing I didn't put down here but we all agreed on, that we're very bad at writing these things up. The one we had in London was never really written up and this one took us months. It's interesting how busy everyone suddenly gets after the event.
There were a few who took incredibly good notes during the event and it's thanks to those people that we were able to put together this summary.
-John Davies-

Quite so, and that single letter makes a big difference in the meaning ;-). This typo was mine originally, and I take full responsibility. What we meant to assert was:
"asynchronous systems compromise consistency, while synchronous systems compromise availability."
Thanks,
-- Randy

One more option is to have multiple horizontal partitions with each partition a separate RDBMS instance. Unlike Oracle RAC, though, the application must not have any transactions going across the partition border. Information retrieval for reporting purposes can run across partitions. This solution is optimized for write performance and availability (assuming every partition has a hot backup), while having a lower SLA for reporting queries.

Hi, John --
Thanks for your excellent summary of this event, and thanks for your kind words about its organization. eBay was happy to host, and helping to get together this distinguished group was a particular pleasure for me. For me, it was the highlight of 2008.
Take care,
-- Randy
Randy Shoup
Distinguished Architect
eBay

Hi John,
Sounds like it was quite an interesting event. Especially with so many smart folks brainstorming with likely strong opinions.
Very good summary.
I am wondering if the following 2 issues were discussed:
1) Dealing with hotspots when partitioning with colocation is in use - i.e. you colocate related data into partitions and move behavior (or expensive queries for that matter) to data nodes. Works very well when the access pattern is uniformly distributed. But, in real world situations, you are faced with the 80-20 rule - 80% of all requests are after 20% of the data. Quite possible with custom application controlled partitioning, all requests converge to just a handful of nodes and capacity gets underutilized. All of sudden the promise of linear scalability with scale out architectures is challenged. There are a number of ways to mitigate the challenges, but, I am curious if this was discussed and if anyone offered up interesting ideas.
2) GC challenges with large heaps - With extensive coverage on caching and in-memory data grids, was this a topic of discussion? anything you can share? folks shared their experiences with the new G1 collector?
regards,
Jags Ramnarayan
GemStone Systems

..I am wondering if the following 2 issues were discussed:
1) Dealing with hotspots when partitioning with collocation is in use
2) GC challenges with large heaps

Jags - I believe that the answer to your question is that those topics where discussed at some level.
Later on some of those discussions inspired different individuals posts that covered more specific area's such the one i pointed out from Brian and Randy. I posted something that i believe covers some of your questions about GC imact on latency more specifically: Latency is Everywhere and it Costs You Sales - How to Crush it - My Take In this post i cover the impact of GC on latency and how to reduce that impact.
To your specific question you can distribute logic to where the data is to reduce latency while at same time make sure that the logic will be triggered only when needed and not continuously to avoid the potential overhead that you mentioned.
Nati S.

nati,
nice article. Especially like the notion of "global vs local optimization". I sort of see this as a fundamental mental shift when thinking about scalable systems where rather than partitioning your interdependent services physically (runnig on different hosts) you partition the data model horizontally and move behavior and messaging between closely coupled services to be local operations.
But, coming back to my question, when you skin the cat a bit more, you realize there are several challenges with partitioning.
(1) there are times when applications converge on a small cross section of the data resulting in "hotspots". The problem can get severely exacerbated when the colocated behavior is compute intensive. If you dynamically rebalance (move data hence behavior), you could be dead wrong. Remember - past is no guarantee for what the future entails.
2) When the app access is uniform across all partitions (say many concurrent writers that are uniformly load balanced across all partitions) then all it takes is one partition to get into a full GC cycle and in no time you will have everyone converging to this partition and get stuck. Meanwhile all the healthy partitions have their CPU idle. Finally, when the darn fragments are all coalesced by the so called "low pause collector" (CMS - the experimental sorta low pause collector :-)) the next partition can potentially get into this problem. Of course, there are several techniques to mitigate this and each of our products might have intelligent options, coding practices, etc, but, I was wondering if others brought this up.
On the same line, I was trying to understand if folks shared any experiences with the new G1 collector and techniques used in practice to mitigate the GC pause issues.
quite appreciate you pointing me to this informative article.
regards,
Jags Ramnarayan
GemStone Systems

TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations technology projects - with its network of technology-specific websites, events and online magazines.