Posted
by
Soulskill
on Tuesday February 26, 2013 @04:51PM
from the if-you-want-something-done-right-do-it-yourself dept.

Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."

The performance claim in the summary seems to come from page 15 of this presentation [intel.com], where the speedup for a 1TB sort (presumably distributed) is 4 hours -> 7 minutes. I can't find the details for that test, but most of the speedup comes from using better hardware - faster CPU and network adapter, and SSDs instead of HDDs - while they get a 40% speedup from using their Hadoop distribution over some other Hadoop distribution, which is a fairly modest gain.

The biggest performance benefit of Spark comes from avoiding disk and network access, so improving those bottlenecks will presumably reduce Spark's lead over Hadoop somewhat. But it's hard to say how well Spark would do with this particular hardware and test setup. I would guess it's still much faster than their Hadoop distribution. (Note: I'm a Spark power user but not an expert in its performance.)

Yeah, the details in that presentation describe something far less impressive than the top-line "4 hours -> 7 minutes" claim. You are absolutely correct that only a very modest amount of the ~35x speedup claimed is attributable to the Intel Hadoop distribution itself, with the bulk of the speedup coming from significant hardware upgrades across the cluster. Spark wouldn't benefit from the hardware changes in exactly the same way, but it would still see significant gains from upgrading the cluster hardw

It's impossible to say without the details of apples-to-apples comparisons, but superficially, none of the announcements of "improved Hadoop" from Intel, Greenplum, Hortonworks, etc. is all that impressive in comparison to Spark even if you assume that none of their improvements can or will be integrated into Spark. Take, for example, a couple of the claims that Intel is making for their new Hadoop distribution. First, the "four hour job reduced to seven minutes" claim is the same ballpark 30-40x claim ma