Archive for the ‘Crunch’ Category

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

Apache Zookeeper 3.4.5

Apache Flume 1.3.1

Apache HBase 0.94.5

Apache Pig 0.11.1

Apache Hive 0.10.0

Apache Sqoop 2 (AKA 1.99.2)

Apache Oozie 3.3.2

Apache Whirr 0.8.2

Apache Mahout 0.7

Apache Solr (SolrCloud) 4.2.1

Apache Crunch (incubating) 0.5.0

Apache HCatalog 0.5.0

Apache Giraph 1.0.0

LinkedIn DataFu 0.0.6

Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉

Enjoy!

(Participate if you can but at least send a note of appreciation to Cloudera.)

Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:

Interesting coverage of Crunch.

I don’t know that I agree with the characterization:

… using Hadoop …. require[s] learning a complex process called MapReduce or a higher level language such as Apache Hive or Apache Pig.

True, to use Hadoop means learning MapReduce or Hive or PIg but I don’t think of them as being all that complex. Besides, once you have learned them, the benefits are considerable.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

In or between which of these elements, would human analysis/judgment have the greatest impact?

Would human analysis/judgment be best made by experts or crowds?

What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)

Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)