With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google&apos;s Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.

7.
Apache
• Many projects
– some well known … Hadoop, httpd, Tomcat, Solr
– some very obscure/inactive
• Community over Code
– lightning rod projects are fine … but not at Apache
– code is dead, community is living
• Consensus driven
– respectful debate is fine, argument is not
– find points of agreement, move forward

11.
Example Problem
• Jane works as an
analyst at an e-
commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas
and lots of data
User
profiles
Transaction
information
Access
logs

12.
Solving the Problem with Traditional Systems
• Use an RDBMS
– ETL the data from MongoDB and Hadoop into the RDBMS
• MongoDB data must be flattened, schematized, filtered and aggregated
• Hadoop data must be filtered and aggregated
– Query the data using any SQL-based tool
• Use MapReduce
– ETL the data from Oracle and MongoDB into Hadoop
– Work with the MapReduce team to generate the desired analyses
• Use Hive
– ETL the data from Oracle and MongoDB into Hadoop
• MongoDB data must be flattened and schematized
– But HiveQL is limited, queries take too long and BI tool support is
limited

15.
How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1

23.
Let’s hack!
Get code
• git clone https://github.com/apache/incubator-drill.git
• cd incubator-drill/sandbox/prototype
• git checkout 9f69ed0
• mvn clean install
Run Sqlline
• ./sqlline
• sqlline> !connect jdbc:optiq:model=common/target/test-classes/donuts-model.json admin admin
Try out queries
• select * from donuts;
• Try other queries. Note that because donuts.json (our input file) is a JSON file, the SQL parser
cannot figure out what fields exist in a record ahead of time. As a result, we have to use the virtual
field _map which always exists for every record. Then we can access fields dynamically as in these
queries
• select donuts._map['name'], donuts._map['ppu'], donuts._map['batters'] from donuts;