End of January I participated in a panel discussion on Big Data, held during the CISCO live event in London. One of my fellow panelists, I believe it was Sean of CISCO, said there something along the line:

This stuck in my head and I gave it some thoughts. In the following I will elaborate a bit on this in the context of where Hadoop is used in a shared setup, for example in hosted offerings or, say, within an enterprise that runs different systems such as Storm, Lucene/Solr, and Hadoop on one cluster.

In essence, we witness two competing forces: from the perspective of a single user who expects performance vs. the view of the cluster owner or operator who wants to optimise throughput and maximise utilisation. If you’re not familiar with these terms you might want to read up on Cary Millsap’s Thinking Clearly About Performance (part 1 | part 2).

Now, in such as shared setup we may experience a spectrum of loads: from compute intensive over I/O intensive to communication intensive, illustrated in the following, not overly scientific figure:

Here are a some observations and thoughts for potential starting points of deeper research or experiments.

Multitenancy. We see more and more deployments that require strongsupportfor multitenancy; check out the CapacityScheduler, learn from best practices or use a distribution that natively supports the specification of topologies. Additionally, you might still want to keep an eye on Serengeti – VMware’s Hadoop virtualisation project – that seems to have gone quiet in the past months, but I still have hope for it.

Software Defined Networks (SDN). See Wikipedia’s definition for it, it’s not too bad. CISCO, for example, is very active in this area and only recently there was a special issue in the recent IEEE Communications Magazine (February 2013) covering SDN research. I can perfectly see – and indeed this was also briefly discussed on our CISCO live panel back in January – how SDN can enable new ways to optimise throughput and performance. Imagine a SDN that is dynamically workload-aware in the sense of that it knows the difference of a node that runs a task tracker vs. a data node vs. a Solr shard – it should be possible to transparently better the operational parameters and everyone involved, both the users as well as the cluster owner benefit from it.

As usual, I’m very interested in what you think about the topic and looking forward learning about resources in this space from you.