2011-10-10

Oracle and Hadoop part 2: Hardware -overkill?

I've been looking at what Oracle say you should run Hadoop on and thinking "why?"

I don't have the full specs, since all I've seen is slideware on the register implying this is premium hardware, not just x86 boxes with as many HDDs you can fit in a 1U with 1 or 2 multicore CPUs and an imperial truckload of DRAM. In particular, there's mentioning of Infiniband in there.

InfiniBand? Why? Is it to spread the story that the classic location-aware schedulers aren't adequate on "commodity" 12-core 64GB 24TB with 10GbE interconnect? Or are there other plans afoot?

Well, one thing to consider is the lead time for new rack-scale products, some of this exascale stuff will predate the oracle takeover, and the design goals at the time "run databases fast" met Larry's needs more than the rest of the Sun product portfolio -though he still seems to dream of making a $ or two from every mobile phone on the planet.

The question for Oracle has been "how to get from hardware proto to shipping what they can say is the best Oracle server." Well, one tactic is to identify the hardware that runs Oracle better and stop supporting it. It's what they are doing against HP's servers, and will no doubt try against IBM when the opportunity arises. That avoids all debate about which hardware to run Oracle on. It's Oracle's. Next question? Support costs? Wait and see.

While all this hardware development was going on, the massive-low-cost GFS/HDFS filesystem with integrate compute was sneaking up on the sides. Yes, it's easy to say -as Stonebraker did- that MapReduce is a step backwards. But it scales, not just technically, but financially. Look at the spreadsheets here. Just as Larry and Hurd -who also seemed over-fond of the Data Warehouse story- are getting excited about Oracle on Oracle hardware, somewhere your data enters but never leaves(*), somebody has to break them the bad news that people have discovered an alternative way to store and process data. One that doesn't need ultra-high-end single server designs, one that doesn't need oracle licenses, and one that doesn't need you to pay for storage at EMC's current rates. That must have upset Larry, and kept him and the team busy on a response.

What they have done is defensive actions: Hadoop as a way of storing the low value data near Oracle RDBMS, for you to use it as the Extract-Transform-Load part of the story. Where it does fit in, as you can offload some of the grunge work to lower end machines, the storage to SATA. It's no different from keeping log data in HDFS but the high value data in HBase on HDFS, or -better yet IMO- HP Vertica.

For that story to work best, you shouldn't overspec the hardware with things like InfiniBand. So why has that been done?

Hypothesis 1: margins are better, helps stop people going to vendors (HP, Rackable), that can sell the servers that work best in this new world.

Hypothesis 2: Oracle's plans in the NoSQL world depend on this interconnect.

Hypothesis 3: Hadoop MR can benefit from it.

Is Hypothesis 3 valid? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant [Ananthanarayanan2011], Ganesh and colleagues argue that improvements in in-rack and backplane bandwidth will mean you won't care whether your code is running on the same server as your data, or even the same rack as your data. Instead you will worry about whether your data is in RAM or on HDD, as that is what slows you down the most. I agree that even today on 10GbE rack-local is as fast as server-local, but if we could take HDFS out the loop for server-local FS access, that difference may reappear. And while eliminating Ethernet's classic Spanning Tree forwarding algorithm for something like TRILL would be great, it's still going to cost a lot to get a few hundred Terabits/s over the backplane, so in large clusters rack-local may still have an edge. If not, well, there's still the site-local issue that everyone's scared of dealing with today.

Of more immediate interest is Can High-Performance Interconnects Benefit Hadoop Distributed File System?, [Sur2010]. This paper looks at what happens today if you hook up a Hadoop cluster over InfiniBand. They showed it could speed things up, even more if you went to SSD. But go there and -today- you massively cut back on your storage capacity. It is an interesting read, though it irritates me that they fault HDFS for not using the allocateDirect feature of Java NIO, and didn't file a bug or fix for that. See a problem, don't just write a paper saying "our code is faster than theirs". Fix the problem in both codebases and show the speedup is still there -as you've just removed one variable from the mix.

Anyway, even with that paper, 10GbE looks pretty good, it'll be built in to the new servers and if the NIO can be fixed, it's performance may get even closer to InfiniBand. You'd then have to move to alternate RPC mechanisms to get the latency improvements that InfiniBand promises.

Did the Oracle team have these papers in mind when they did the hardware? Unlikely, but they may have felt that IB offers a differentiator over 10GbE. Which it does, in cost terms, limits of scale and complexity of bringing up a rack.

They'd better show some performance benefits for that -either in Hadoop or the NoSQL DB offering they are promising.

(*) People call this the "roach motel" model, but I prefer to refer to Northwick Park Hospital in NW London. Patients enter, but they never come out alive.