The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

SQL compliance

Fully distributed query processing against HDFS and other data sources

ETL feature set

User-defined functions

Compatibility with HiveQL and Hive MetaStore

Fault tolerance through a restart mechanism for failed tasks

Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

Query 1: Heavy scan with 20 text matching filters

Query 2: 7 unions with joins

Query 3: Simple joins

Query 4: Group by and order by

Query 5: 30 pattern matching filters with OR conditions using group by, having and sorting

What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.” Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

Thoughts on Apache Tajo Enters the SQL-on-Hadoop Space

Like most SQL-on-Hadoop benchmarks, the testing leaves us with more questions than answers.

First, they tested against Hive 0.10 while the shipping version is Hive 0.12 + Tez + Yarn. 15 yard penalty and loss of down.

Second, in query 1 they tested 1TB scattered across six 3TB spindles times 6 nodes. That’s 108TB to hold 1.7TB of data. Kinda unrealistic. What we need to see is a Hadoop benchmark of 100TB of data or more. Scan all 100TB, then another query does a massive join of 50TB to 10TB. Customers should not assume real scalability based on easy benchmarks.

Last, with most of the queries spreading 8GB of data across 384GB of memory (6 nodes*64GB), well, all the test data sits in memory, around 1.3GB per node. This is testing code path speed but not disk access which we know is the slowest component.

IMO — Lacking audited TPC council benchmarks, customers should do their own comparisons — at scale.

I see your point about the Hive version to a certain extent, but Hive isn’t the story here and it won’t be until Tez matures. I’d like to see Shark as part of this comparison. Hopefully someone from Gruter will respond regarding your other comments.

Thanks for all the good points and feedbacks on our test results.
I added inline comments hoping that they are helpful.

> First, they tested against Hive 0.10 while the shipping version is Hive 0.12 + Tez + Yarn. 15 yard penalty and loss of down.

The test was performed in end of October last year. By that time, Hive on Tez was still under development. There was no chance to test it and the latest stable version in CDH was 0.10. Yes, absolutely, Hive on Tez is a must-test item in our test plan.

> Second, in query 1 they tested 1TB scattered across six 3TB spindles times 6 nodes. That’s 108TB to hold 1.7TB of data. Kinda unrealistic. What we need to see is a Hadoop benchmark of 100TB of data or more. Scan all 100TB, then another query does a massive join of 50TB to 10TB. Customers should not assume real scalability based on easy benchmarks.
>Last, with most of the queries spreading 8GB of data across 384GB of memory (6 nodes*64GB), well, all the test data sits in memory, around 1.3GB per node. This is testing code path speed but not disk access which we know is the slowest component.

Not in the slides but mentioned in my presentation at the HUG meetup, for each iteration in the test, we did cache drop to remove any memory affect from previous iteration. Regarding the data set size, I agree with your point, like most of other similar tests, the data set is still small considering HW capacity. We will share more test results using bigger data set soon.

> IMO — Lacking audited TPC council benchmarks, customers should do their own comparisons — at scale.

I understand that the internal test results by a vendor, not by a trustful third party entity, could be doubtful. But still, I hope it would be helpful at least to someone who is interested in Tajo project as one of references; and the test was actually done by one of our clients, SKT; I introduced it as a case study. Fortunately, Apache Tajo is an open source project and anyone can set up and perform benchmark test according to their own use case, which means anyone can discuss further or debate on the test results; I believe such discussions are also essential for Apache Tajo project.

@Nick,
First of all, I’d like to thank you and appreciate your post on Tajo.
Regarding Shark, we definitely would like to include it in our next test and share the results.

[…] “Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set…” Read more on Nick’s Gartner blog […]

Categories

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.