Wednesday, June 22, 2011

In the end, it is up to us to make things better. We need a way for non-Apache entities to interact with Apache productively. If we can't do that, then it is quite possible that all of the momentum and excitement that Hadoop now has will be lost.

Conclusion

The key is that we now have an eco-system, not just a community. We can make it work. Or we can elect to let it not work. Not working is the default state. We have to take positive action to avoid the default.

Apache can stay a strong voice for business friendly open source by remaining Apache. Trying to make Apache broad enough to include all of the players in Hadoop and Hadoop derivatives will simply debase the voice of Apache into the average of opposing viewpoints, i.e. into nothing.

There are, however, many other players who are not part of Apache and who should probably not be part of Apache. There needs to be a way for these others to engage with the Apache viewpoint. It can't just be on the level of individuals from Apache trying to informally spread the Apache way even though that is critical to have. It is likely to require a venue in which corporate entities can deal with something comparable to themselves. A good analogy is how Mozilla participation in W3C has made the web a better place.

But we can make our eco-system work. It isn’t what it was and it never will be again.

In the third part of my talk, I talked a bit about where Hadoop has come from and where it is going. Importantly, this involves a choice about where Hadoop and the related companies products and individuals might be able to take things.

Where we are and how we got here

My second section described the rough state of the Hadoop eco-system is a slightly provocative way. In particular, I described a time when I was on a British train and in partial compensation for delays the operators announced that "free beer would be on sale in the galley car". Free beer for sale is a wonderful analogy for the recent state of Hadoop and related software.

That said, there are serious problems brewing. The current world of Hadoop is largely based on the assumption that the current community is all that there is. This is a problem, however, because the current (Apache-based) community presumes interaction by individuals with a relatively common agenda. More and more, however, the presence of a fundable business opportunity means that this happy world of individuals building software for the greater good has been invaded by non-human, non-individual corporations. Corporations can't share the same agenda as the individuals involved in Apache and Apache is constitutively unable to allow corporate entities as members.

This means that the current community can no longer be the current world. What we now have is not just a community with shared values but is now an eco-system with different kinds of entities, multiple agendas, direct competition and conflicting goals. The Apache community is one piece of this eco-system.

Our choice of roads

Much as Dante once described his own situation, Hadoop now finds itself in the middle of the road of its life in a dark wood. The members of the Apache community have a large voice in the future of Hadoop and related software.

As a darker option, the community can pretend that the eco-system that now exists of human and corporate participants is really a community. If so, it is likely that the recent problems in moving Hadoop forward will continue and even get worse. Commit wars and factionalization are likely to increase as corporate entities, denied a direct voice in Apache affairs, will tend to gain influence indirectly. Paralysis in development will stall forward progress of Hadoop itself leading to death by a thousand forks. Such a dark world would let alternative frameworks such as Azure to gain footholds and possibly to dominate.

In this brighter alternative future, I think that there are ways to create a larger forum in which corporate voices can be heard in their true form rather than via conflicts of interest. In this scenario, Apache would be stronger because it really can be a strong voice of the open source community. Rather than being the average of conflicting views, Apache would be free to express the shared values of open source developers. Corporations would be able to express their goals, some shared, some not in a more direct form and would not need so much to pull the strings of Apache committers. Importantly, I would hope that Hadoop could become something analogous to a reference implementation and that commercial products derived from Hadoop would have a good way to honor their lineage without finding it difficult to differentiate themselves from the original. Hopefully in this world innovation would be welcomed, but users would be able to get a more predictable experience because they would be able to pick products offering whatever innovation rate/stability trade-off that they desire. Importantly, there would be many winners in such a world since different players would measure success in different terms.

We have a key task ahead of us to define just what kind of eco-system we want. It can be mercenary and driven entirely be corporate goals. This could easily happen if Apache doesn't somehow facilitate the creation of a forum for eco-system discussion. In such an eco-system, it is to be expected that the companies that have shown a strong talent at dominating standards processes and competing in often unethical ways will dominate. My first thought when I imagine such a company is Microsoft, but that is largely based on having been on the receiving end of their business practices. I have no illusions that talent for that kind of work is exclusively found in Redmond.

In my talk, I proposed some colorful cosmological metaphors for possible worlds, but the key question is how we can build a way for different kinds of entities to talk. It is important to recognize different values and viewpoints. Apache members need to understand that not everything is based on individual action, nor do corporation hold the same values. Companies need to take a strong stance to recognize the incredible debt owed to the Apache community for creating the opportunities we all see.

If we can do this, then Hadoop (and off-spring) really does have a potential to dominate business computing.

In the first part of the talk, I made the case that Apache Hadoop has lots of head-room in terms of performance. This translates into lots of opportunity both for open source developers to make Hadoop itself better, but also for companies to build products that derive from Hadoop but improve it in various ways.

The $S$ score

In honor of Steve Jobs whose highest praise is reputedly to say "that doesn't suck", I proposed an $S$ score whose highest score is zero, but for all real systems is always negative. For a batch, data processing system like Hadoop, I proposed that a good definition of $S$ was the log base 10 of the ratio of the actual performance to the performance implied by hardware limits.

Not suprisingly, the overall score for Hadoop comes out to be somewhere between -5 to -2 depending on desired workload (i.e. Hadoop runs programs somewhere between 100 and 100,000 times slower than the hardware would allow). For some aspects, Hadoop's $S$ score can be as good as $-0.5$ but generally there are multiple choke-points and some of these are additive. This is hardly news and isn't even a mark of discredit to Hadoop since the developers of Hadoop have always prized getting things to work and to work at scale above getting things to work within an iota of the best the hardware can do at a particular scale. Another factor that drives $S$ down for Hadoop is the fact that the hardware we use has changed dramatically over the 6-7 year life of Hadoop.

In defining the value of $S$ for current Hadoop versions, I don't mean to include algorithm changes. Michael Stonebraker has become a bit famous for running down Hadoop for not doing database-like things with database-like algorithms, but I would like to stick to the question of how fast Hadoop could do what Hadoop is normally and currently used to do.

The key conclusion is that having such a low $S$ combined with high demand for Hadoop-like computation represents a lot of opportunity. This opportunity involves opportunities for the open source community to make things better. It also represents opportunities for commercial companies to make money. The latter kind of opportunity is what is going to shake up the currently cozy Hadoop community the most.

There has been a bit of demand for an expanded version of my Buzzwords keynote from a few weeks ago. This demand has been increased by a particular unfortunate mis-quote in a tweet that suggested that I thought that there was a need for a new organization "to supersede Apache". Of course, I suggested nothing of the sort so it is a good idea to walk through the ideas that I presented. The buzz words site has the video of my talk and pdf of my slides in case you want to follow along.

The talk was divided into several sections. The first one proposed the uncontroversial thesis that Hadoop performs at a level far below the potential offered by modern hardware. A second section pointed out difficulties with the current social structure surrounding the development of Hadoop and related software. I then examined what I see as possible futures while describing how I think we will be choosing between these alternative futures. I will post each section in a separate blog entry.

As I spoke, I encouraged the audience to tweet using the hash-tag #bbuzz and to keep communal notes on a shared Google document. The tweets are a bit hard to find as befits ephemeral media, but the shared notes are still accessible.

Monday, June 20, 2011

Well, Buzzwords is over and my primary conclusion is that I wish I had come last year as well as this year. Isabel and Simon are really making Buzzwords into a major open source conference and with the demise of the European ApacheCon, Buzzwords is probably the the first or second most important open source conference in Europe. If you only can choose one, I would strongly recommend geeking in Berlin. It isn't just the conference; there are bunches of related events such as informal dinners, bar camps and hackathons. Since Buzzwords makes such a strong effort to include North American participants you may even have a better chance of connecting globally by going to Europe than going to a conference in the US or Canada.

The conference itself consisted of two days of scheduled events anchored by keynotes each day. Doug Cutting gave the first keynote and covered a lot of the history and current state of Hadoop. As always, his talk was very well done and contained quite a bit of technical information which is refreshing in a keynote. I gave the second keynote and talked a bit about the state and future of Hadoop, related Apache projects and the burgeoning commercial marketplace. Some of what I said stirred up a bit of talk, which is good since my primary thesis that we aren't talking enough about how the world of Hadoop and related software is rapidly changing in ways that aren't well recognized. Stay tuned here for a blog edition of my talk.

There were quite a few excellent technical talks as well. Among the scheduled talks, Jonathan Gray gave a talk which his usual and customary dose of excellent technical information about how Facebook is using Hbase. A notable moment came when he was asked about the state of Cassandra at Facebook. Check out the upcoming video for details on his answer.

Dawid Weiss gave an excellent talk on finite state automata and the difference between deterministic and non-deterministic variants. The only defect I could see in his presentation was that we couldn't see the eagles on the coins. Based on the fact that the room was packed (I sat in the aisle on the floor) and the very eager audience questions, I would say that there is a surprisingly strong market place for information on foundational algorithms like finite state transformers.

The lightning talks at the end of day two also had some gems. Thomas Hall's northern accent blended charmingly with the frank assessment of some of his experiences with certain technical approaches. I can't possibly convey the tone and content so, yet again, you will need to refer to his slides and the video on the conference web-site.

Frank Scholten also had a lightning talk that contained a very nice walk-through of Mahout document clustering. What he showed is a work in progress, but already what he has provides a highly requested set of recipes to illustrate a lot of the software in Mahout.

Outside of the conference there was an (excellent) barcamp run by Nick Burch. I think I learned as much about how to run a barcamp by watching him as anybody did from any of the technical discussions and the technical discussions were pretty excellent.

I have to say that if you want to see me next year in early June, there is a high likelihood that you will have to be in Berlin to do it.

See http://berlinbuzzwords.de/wiki/linkstoslides to get slides from talks.

I had a great visit today at the DIMA laboratory at TU in Berlin. They are working on an interesting system called Stratosphere which provides an interesting generalization generalization of map-reduce. Of particular interest is the run-time flexibility for adapting how the flow partitions or transfers data.

They accomplish this by having a lower level abstraction layer that supports a larger repertoire of basic options beyond just map and reduce. These operations include match, cross product and co-group. Having a wider range of operations and retaining some additional flow information at that level allows them to do on-the-fly selection of the detailed algorithm for different operations based on the statistics of the data and the properties of the user-supplied functions.

Here's a pic of me answering questions about startups and log-likelihood ratio tests.

Wednesday, June 1, 2011

I have recently been working on some new features for Apache Zookeeper that represent probably the largest change in the Zookeeper client API in several years. Other recent changes such as observers and read-only clusters have changed the server-side semantics, but the client API is nearly unchanged since the original public release of Zookeeper several years ago. The JIRA covering the overall issue is Zookeeper-965 and you can look at the code as it is being refined into committable state in my fork of Zookeeper that I keep on github.

Almost from the beginning of the Zookeeper project, there have been repeated questions on the mailing about how to update several nodes at once. The answer has always been to consolidate the structures that you need to update atomically into the contents of a single znode and then update those contents atomically using standard methods. This update problem is often exacerbated by a desire to use ephemeral nodes so that state disappears correctly when a service crashes.

This is a pretty reasonable answer and it handle many important cases fairly well. For instance, in the common case of servers that are assigned certain roles and then in turn advertise which roles they have successfully taken on, this pattern can be used by giving each server a single file of assignments and a single file of advertised capabilities. Each of these files can be ephemerally created by the server so they will both vanish when the server disappears. Assignments can be added atomically and only the server ever modifies the advertised roles. Zookeeper works quite well in these cases.

Keeping track of a graph full of nodes with directional connections to other nodes is a good example of where Zookeeper's update model is not so nice. In this case, nodes have a list of connections to other nodes and a list of connections from other nodes. If any node has a list of outgoing connections that are not matched by the corresponding list of incoming connections on other nodes, then the graph is referentially inconsistent. Keeping the graph consistent under changes is not possible with normal Zookeeper updates unless you store the entire graph in a single file.

Lazy Operations

The new multi() method allows such a data structure to be updated with strong guarantees of consistency. This is done by currying the normal database mutation operators and then passing all of the resulting closures to Zookeeper at once for execution. Obviously this requires a bit of syntactic sugar in languages like Java and C which don't like to express closures directly.

The way that this works is that there is a new class call Op. There are sub-classes of Op called Op.Create, Op.SetData, Op.Check and Op.Delete that correspond to the operations normally invoked by calling the similarly named methods on a ZooKeeper object. In essence, these sub-classes of Op represent reified methods or closures that can be frozen at one point in time and then executed later. These sub-classes are instantiated using factory methods on Op class. The names and signatures of these factory methods mimic the corresponding methods on ZooKeeper.

Once you have all of the operations you would like to perform, you can execute them all in a single transaction using ZooKeeper.multi(). This will either execute all of the Op's or none of them.

An Example

I have placed an example program for doing a graph-based computation over on github. This program builds a graph consisting of 32 nodes in the form of a 5-dimensional hyper-cube and then uses a numerical technique called over-relaxation to solve for the voltages that would result if all the links in the hyper-cube were replaced by unit resistors. A picture of the graph is just to the right. In this graph, the voltage for node 0x1F is labeled as $V_5$ and the voltage for node 0x0 is labeled as $V_0$. There are lots of connections and solving this problem analytically is difficult unless you twig to the trick of using symmetry. Real-world networks generally don't exhibit such congenial properties and can be nearly impossible to solve analytically.

Normally, of course, we wouldn't actually use Zookeeper for this computation, but the point here isn't so much a realistic computation as a test-bed for the new multi() method and a demonstration of how the closure based style associated with multi-ops makes code easier to read and understand.

The Code

To read the code, start with the class ZGraphTest. This is a JUnit test that drives all the rest of the code. In this test, a graph is constructed (named zg in the code), nodes are connected in the pattern of a hyper-cube which means that nodes are labeled with integers from 0 to 31 and a node $n_1$ is connected to another node $n_2$ if $n_1$ and $n_2$ differ in exactly one bit.

In each iteration of the code, the ZGraph.reweight() method of the graph is called on a randomly selected node. This sets the value at that node to be the average of the values at the neighbors of that node. This computation converges geometrically to the correct value with errors decreasing by a factor of 5-10 with every 1000 iterations. As the actual computation proceeds the error versus the theoretically known correct values is shown every 1000 iterations and you can see the errors go to zero with 6 significant figures at about 7000 iterations.

Internally, a ZGraph keeps a connection to Zookeeper and the name of the root directory for all of the nodes in the graph. Methods like connect(), reweight() and update() all work by defining lazy update or version check operations on Node objects and then executing all of the update or version checks at one time in a transaction. For instance, in the reweight() method, this code gets the weight of all of the neighbors of node g1:

double mean = 0;

Set neighbors = g1.getIn();

List ops = Lists.newArrayList();

for (Integer neighbor : neighbors) {

Node n = getNode(neighbor);

ops.add(n.checkOp(root));

mean += n.getWeight();

}

As a side effect, we also collect a list of version check operations into the list ops. Then in this code we add one more operation to set the weight on g1 and add the update operation to the list ops:

// set up the update to the node itself

g1.setWeight(mean / neighbors.size());

ops.add(g1.updateOp(root));

Finally, multi() is called to actually do all of the version checks and updates that we have collected,

zk.multi(ops);

The essential point here is how the list of operations was collected a bit at a time and then executed all at once. The versions of the nodes that were inputs into the operation were collected in the form of check operations. When those check operations are executed along with the update of g1, we guarantee that g1's new weight will accurately reflect the average of its neighbors and if somebody changes the value of any of the neighbors in between the time that we read the value and the time we actually do the update, we will run the update again.

Similarly, and closer to the idea of maintaining referential integrity, the ZGraph.connect() collects update operations for the two nodes being connected and executes both updates together to avoid the possibility that anyone would see a situation where one node connects to a second but the second doesn't have the reverse connection. This code does the job,

One additional point to be noted is that having each Node return closures for updates that can be executed in any context the caller would like has the effect of removing all direct Zookeeper operations from the Node other than for reading or creating a node. It also removes all references to serialization of a Node from any outside code. This simplifies both the caller and the Node because the Node doesn't have to implement all plausible combinations of multiple updates and the caller doesn't have to worry about any aspect of serialization and can focus on the appropriate task of arranging the transactional semantics for updates.

Credit Where Credit is Due

Of course, this new Zookeeper capability wouldn't be possible if the Apache Zookeeper project team hadn't built Zookeeper in a very flexible way in the first place. Kudos to Ben Reed and Mahadev and the other early committers on the project.

Also, the actual Zookeeper-965 project would have stalled out if Marshall McMullen and Camille Fournier hadn't stepped in with a bit of extra momentum. I had completed the wire protocols and all of the client side work, but Marshall did all of server side work and Camille provided really excellent advice about how the changes should be done.

The final credit goes to MapR Technologies for supporting open source by allowing capabilities like multi to be contributed back.