Big data and analytics go hand-in-hand, yet the technology for processing isn’t at that level. Just yet. Apache Drill, based upon Google Dremel aims to change things. Is it a Hadoop killer? It doesn’t look like it…

The enterprise need for large-scale
datasets has grown over the past few years – and it’s now the norm
to have to deal with phenomenally large datasets. But as this
demand has increased, so has the demand for something that is near
to real-time as possible, yet still providing comprehensive
analysis.

Currently, thestandard toolssuch as the batch processing Hadoop and stream processing
Storm, don’t allow this, although the groundwork has been put in
place (by the Apache Hadoop 2.0 roadmap) to make the process more
intuitive and provide deep analytic reporting at the click of the
fingers.

Over the past two years, the Apache Software Foundation has
assumed the caretaker role at the centre of Big Data evolution, and
a new project has entered the Apache Incubator seeking to push the
boundaries further for data-intensive operations.

Inspired byGoogle’s
internal interactive system Dremel,Apache
Drillwill be a distributed system that scales
across 10,000 servers and processes petabytes of data from
trillions of rows in seconds. The aim is to shift away from
established inefficient methods of processing to one that is
flexible and can process nested data without too much heavy
lifting.

Apache Drill comes fromMapR Technologies– a
Hadoop vendor that differs from the competition, in that rather
than having a specific focus on making the Apache Hadoop codebase
as strong as it can be, MapR chose to develop their own advanced
flavour and make that available commercially. It’s not surprising
to see them wanting to push the project at an open source
level,to gain the approval
of their supposed competitors and welcome them to become a part of
it. Afterall, the keyplayers want the
ecosystem to be as healthy as possible.

The proposal notes the exploration problems that Apache
Hadoop faces in analysing data at a sub-second level,optingfor high-throughput first. Pushing it to
the largest open source havenbrings it closertobeingthe de-facto standard for
data digging–and the only way to do
that is to tap into the entire community. There’s no better
environment currently than Apache for that.

Like Dremel, Drill doesn’t intend to replace MapReduce, the
processing method in Hadoop still used to complement Dremel at
Google. In fact, it would certainly want to work in conjunction
with the already established technique. The committers behind Drill
also realise that something of Dremel’s scale has yet to be
attempted on an open source level, pulling the array of query
languages and data formats. Consequently, it could spend a fair bit
of time within the Apache Incubator before its initial
arrival.

Apache Drill is split into four key components, which the
team say will form the bulk of the next move, ensuring that all
four layers are implemented. They are:

Query languages – responsible for parsing queries and
constructing the execution plan. Initially this will support
Dremel’s SQL language and GoogleBigQuery, butfurther
alongexpect NoSQL solutions MongoDB and Cascading to
feature.

An execution engine, providing the necessary scalability
and fault tolerance needed to query the vast amount of
data.

Support for nested data formats such as JSON, BSON and
CSV, amongst others.

Support for scalable data sources, with an initial plan
to leverage Hadoop.

Whilst MapR will undoubtedly lead the project (through the
expertise of Hadoop veteran Ted Dunning and execution specialist
Tomer Shiran), other companies suchDrawn to ScaleandConcurrentwill sit alongside as early committers. The design documents
will be housed within MapR repositories.

It’s a very bold proposition, but anecessaryone for the world of Big Data should
it want to progress in line with the demand of the enterprises who
support the community. This proposal is still in its infancy so it
could be some before we see defined architecture but early signs
are promising. Theinitial proposal slidesprovide more detail
on Drill itself.

The potential community interested in something like this is
huge, linking into other Big Data focused projects. It will also be
interesting to see Drill grow to include such projects and whether
or not it will adopt an inclusive policy, or choose only the cream
of the crop. Hadoop, Avro, Hive and HBase have already been touted
as close bedfellows – now we wait to see who elsewill
jump onboard…