Eva Andreasson on Hadoop, the Hadoop Ecosystem, Impala

Recorded at:

Bio Eva Andreasson has been working with JVMs, SOA, Cloud, etc. for 10 years. At JRockit she got two patents on GC heuristics and algorithms. She also pioneered Deterministic GC which was productized as JRockit Real Time. After two years as the PM for Zing at Azul Systems, she joined Cloudera in 2012 to help drive the future of distributed data processing through Cloudera's Distribution of Hadoop.

GOTO Aarhus is an premier software development conference designed for developers, team leads, architects, and project managers. At GOTO, the program is created for developers by developers. Our concept has always been to present the latest developments as they become relevant and interesting for the software development community.

That’s a good question, I work as a Product Manager for Cloudera, I live in Silicon Valley. Cloudera is a small startup, it’s very exciting, we work with the best distribution of Hadoop and its ecosystem, but I have a past as a JVM Garbage Collection developer as well, I worked a long time with JRocket and also with Zing, and there are various opinions about patents but for some reason I happen to have some patents in self-learning Garbage Collection as well.

Cloudera is a very exciting startup in Silicon Valley. I joined Cloudera to work with a technology called Hadoop. Cloudera is the main distributer of Apache Hadoop and the ecosystem Projects around Hadoop. Cloudera is for Hadoop... maybe a good analogy is like as Red Hat is for Linux, Cloudera is for Hadoop.

Good question, so you probably have heard about this big data challenge happening all around the world and it’s just grown over the last five years. Big data is about volume, velocity and variety as we do more business online as, your clothes start sending messages or your home, or car generates more events. Traditional systems have a challenge of keeping up with the volume and the new data types and also especially cost-wise when it comes to fitting new types of data into your traditional data models, traditional systems aren’t as flexible and also when it comes to scale being able to handle ten times the load or an unknown amount of data growth over the next three years. Companies have started thinking that we can’t afford to scale up the traditional way, we need a more cost efficient scalable model and Hadoop came along in the same time as these problems became very evident, and Hadoop is of course the answer to those problems.

Very good question, I get that a lot, what is Hadoop. Well Hadoop consists of two projects, one is called Hadoop Distributed File System, that is where you actually put you data, store your data in little tiny chunks across a whole cluster of commodity hardware in a distributed fashion. Now MapReduce is the processing framework, it’s a Java framework where you can implement your algorithms that you want to apply to your data and then when you kick off the MapReduce job, kick off gazillions of JVMs that process each piece of data in parallel and aggregate the result back to you query from. So in that way you scale out linearly by adding nodes where more little pieces of your data can live and more JVM based MapReduce processing can happen. That is the simplest way that I can explain Hadoop.

It’s a distributed file system, the data is stored as-is the raw data files can be stored there as-is, that is the whole magic because you don’t have to decide on a structure on your data at storage time, instead Hadoop gives you the flexibility and ability to apply the structure at query time, meaning when you actually want to do something with your data you apply the structure, you extract from your entire raw data set of relevance the data you want to do something with and then you do whatever that processing it through MapReduce or other ways. The key/value store that you mention, there is another component in the Hadoop ecosystem called HBase, that is the key/value store in Hadoop land, but it’s actually based on HDFS, so the storage layer underneath HBase is still the distributed file system.

It’s more like get and put over a specific keys, so you can scan a lot of data just based on keys and get all the data related, connected with that key back to your perhaps web serving application. HBase for instance is often used for catalogue look ups or web click streams. You want to find out every click that a certain set of your online customers have done on your web page to trace what paths are more popular, HBase is very often used for that kind of storage of all of the clicks of the website, and now start thinking about what you can do with that data, you can understand where people go, when and why and where they struggle to optimize you user experience on your website, pretty amazing when it comes to personalized experience and marketing.

That is the magic, right? The processing is actually brought to your data, it’s executed in the same platform. This is switching around the whole concept of how the world worked before where you actually brought your data to your processing, so you had different infrastructure pieces where you needed to move you data between to achieve your business use case. Now Hadoop is bringing the workload to your data, so it’s actually executed on all those file pieces stored in HDFS, so MapReduce’s executing right on top of the data nodes.

Werner: So the data nodes that store data also get the MapReduce logic on them?

Yes simply speaking let me explain: so first of all it has tools built on top of MapReduce, so once the MapReduce project came around and you could actually process your data in HDFS. MapReduce still require you to program a MapReduce Java job, you had to be a programmer to do that. Of course very quickly that open the door for how do we simplify this for categories of users that actually don't necessarily come from a programming background but more from a BI and DBA background, so the projects named “Hive” and “Pig” came around to facilitate the SQL option of accessing your data. So they are actually SQL based query engines, based on MapReduce that run on Hadoop as well. They translate your traditional SQL query into a MapReduce job and you can use your standard JDBC or ODBC connectors to connect your favorite BI tool and do your queries that way over much more data with different structure than traditional systems can handle.

Werner: So JDBC, ODBC the back and forth those is Hive or Pig, ...

Hive. Pig is a more process oriented SQL tool. I’m keeping it very simple, so that everybody who watches this can understand. Hive and Pig in MapReduce are based on JVM’s, so Java that executes the logic, so it has a little bit of startup time and also is more batch oriented, you kick it off it runs for maybe few minutes, half an hour, hours, depending on your data size and what you want to do with it, that opened a need for a quicker way, there are not all use cases that can be fine with a batch oriented execution, there are other use cases that require more near-realtime or even realtime query. So once Hive and Pig had become popular and deployed in production and the BI Tools like MicroStrategy, Tableau, etc talk to Hive through JDBC or ODBC. The response time for the BI analyst sitting there was just too long, so the need for a more realtime query tool came along and that is where Impala came around, and that is not based on MapReduce, that is a different work load.

It’s a completely different idea behind it, it’s a different paradigm, so it doesn’t kick off any gazillions of JVM processes to execute a little processing of your data. Instead it runs natively on HDFS, it has daemons on every node, daemon processes, that runs and can execute your query as it comes in. So it bypasses the MapReduce framework but it also utilizes various very low level optimizations for instance LLVM, it's a package, I think it originates from Intel, but it’s an open source package where you can actually rearrange the order you execute your commands, so it’s almost like an Instruction Optimizer, you are putting the commands or execution commands in the order that is most optimized for that query, so it’s a little bit more advanced than just bypassing the MapReduce work load.

Werner: So when you say native you mean, it’s actually a native code, Impala is written with native code.

It’s written in C++ and then it utilizes still various Linux libraries like the LLVM library for instance, so yes, it executes natively with the platform.

So I love the Java crowd by the way, my previous lives are in Java and I still follow the progress of Java and I’m a Java enthusiast. But Hadoop is not entirely based on Java, the community there is not entirely serving the Java crowd, it’s more about data processing, so it doesn’t really matter how things are implemented. We are trying to solve a bigger picture problem, we are trying to optimize for realtime queries and batch queries, and the MapReduce Framework comes with a lot of benefits still, you can have it running for overnight and it’s very robust, it handles failover, a little task and restarts them and optimizes them, where to place them if there is a replica of your data elsewhere and one node is busy, it actually starts executing your logic elsewhere to distribute the workload. It’s very smart and very good, so both are of value, one happens to be written in Java and other one is written in C++, there is no preference, it's for different end use cases.

Yes, we are all about the community, it is very important because you can’t obviously have all the smart people in one place, so opening it up, having a standard, standard platform being created with many, many brains involved, a lot of innovation contributed, and yes, we have a lot of brilliant people working at Cloudera, who are the leads, main people in the community but we want to encourage everybody to participate, that is how you build a solid and future proof platform.

A very common question I get is, ok now that you bring MapReduce and Impala and we also recently integrated Solr, which is the most popular and most widely deployed search engine out there, open source by the way, and when you bring all these workloads to the same platform, how do you make the resource management more efficient? How do you actually control all that, no workload starves the other since all of them are executing on the same data platform, so you don’t have to move your data around? Now there is a project called “YARN”, that was initiated I think two years ago, take that like 80% confidence, but it was started some years ago and it has been in development for a while and it’s getting close to being really, really production ready. It’s still in progress but we are getting there and YARN is a new way of resource scheduling across the platform. Mainly it came around because there were other workloads popping up that needed to utilize the same scalable storage and also the same thinking about processing across a lot of data in a distributed fashion. So Yarn is a new way of resource scheduling across a cluster, that is an interesting piece to keep an eye on. [Editor's note: http://www.infoq.com/news/2013/10/hadoop-yarn-ga ]

So it’s very closely connected with HDFS and MapReduce, basically changes the concept of MapReduce being the workload on HDFS to MapReduce being one workload that asks for resources and scheduling order and priority and queue and time from this scheduling layer. That is as easy as I can make it, otherwise we need to spend an hour on it.

Werner: Maybe we can spend 3 minutes on it, we can go deep.

So in old MapReduce land, when you started a MapReduce job, the job tracker kept track of all the jobs on the Cluster while each job consisted of many tasks and there was a task tracker involved as well. Now instead of having the job tracker making sure that each jobs tasks get the resources necessary to execute where the data is located, YARN came around with a new approach of separating out actual resource requests from the job tracking or the task tracking if you will. So the job process and the resource process are separated. So there is not bottleneck on resource requests anymore, does that make sense? I try to make it understandable.

Well last I counted it was at least between 13 and 20 different projects to keep an eye on; 12-14 are already in Cloudera’s distribution that you can download, it’s a free distro. [Editor's note: http://www.cloudera.com/content/cloudera/en/products/cdh/projects-and-versions.html ]. Some I would mention that are in the district today are Oozie which easily helps you schedule a workflow if you want to do multiple MapReduce jobs, like a data pipeline, data ETL pipeline or if you want to do regular Hive queries and do some MapReducing between that kind of workflow scheduler, that is Oozie. And then we have Zookeeper which is more of a process management component that handles failover if one HBase server goes down, then someone else should take over.

And then you have Hue which is a really nice user interface, many people start with Hue and get a feel for what Hadoop is about, it is very nice and interactive, very simple, you can browse your files in HDFS, you can kick off a MapReduce job, you can do a Hive query or an Impala query, you can do some free text search, so all from the same little nice UI, very cool UI by the way. And then of course there is more to come, I mean there are projects out there who are related to the Hadoop Community. The Hadoop Community itself is booming, so there are little new projects popping up everywhere but there are not yet in the distro they are still maturing, we are kind of watching where this community is going, bringing new projects in as needed. Most recently we added Accumulo but that was just yesterday so I’ll update you on that next time.

Werner's full question: So I think we’ve covered Hadoop and all the goodness that is there, so Hadoop is, so you talked about the batch processing in Hadoop, so you mentioned that in one of your previous lives you worked on Garbage Collection, so with that experience in mind, with batch processing I think Garbage Collection isn’t really a problem I guess, the Garbage Collection pauses, is that correct?

It’s right because how most MapReduce jobs work is they kick off a JVM that does a little piece of processing and then it shuts down, so you never really reach the filling of the heap that causes a Garbage Collection in the middle of a data processing loop. But there are other projects who suffer more from Garbage Collection pause times because they are long term running processes that are in memory workloads, such as HBase and Solr.

HBase and Solar are memory-heavy workloads and they run on Java, I mean they are Java processes, they run on a JVM and when you have a long time running process on a JVM that are generating a lot of dynamic workloads which both HBase and Solr does, you will experience Garbage Collection, it doesn’t have to be a pain, many customers tune their Garbage Collector and it’s fine, but there are problems where the load kind of hit that pain point where fragmentation becomes an issue.

Werner: So fragmentation of the heap is that really a problem like a…

That is The Problem with all Garbage Collection in my mind. In my opinion fragmentation is the pain to all the evil that comes with Garbage Collection.

I’m like allergic now because I have ten years of Garbage Collection in my past, but copying Garbage Collectors were kind of eliminated quite early on but you are right in the point that moving data is the problem even in Garbage Collection land not only in Hadoop land or in Big Data land, like the move of data is always costly even in the heap because you have to make sure that nothing is modified, nothing is mutating when you move an object, and eventually when you have a fully fragmented heap you have to compact the whole heap because compacting just a portion won’t free up enough memory, and that full compaction is a stop-the-world operation and that is you long Garbage Collection pause right there. And it’s about moving objects together.

Werner: But is not actually a Garbage Collection it's just the same thing as a Garbage Collection, you basically have to stop the world and gather up objects.

Actually the compaction is a phase of the Garbage Collection, so it’s actually just a portion of the Garbage Collection that usually does the compaction but when you can’t compact enough you actually have to do Garbage Collection after Garbage Collection and you see multiple Garbage Collections after each other in your log, that is a sign that you are about to run out of the memory and the Garbage Collector is trying to free it but it can’t, it can’t keep up with your allocation rate, and when you have back to back Garbage Collections and then finally an OutOfMemory is thrown, I think that is a sign that you couldn’t keep up compacting enough to free up for new allocation coming in, new objects coming in. What the Garbage Collector does in the final phases of trying to free up memory, is sometimes to decide to make a full compaction and that is part of still the Garbage Collection work but is just a phase and it’s one of the final last ways out to get back on track. Let’s stop everything and just compact if possible, and if you can't even do that then you get an Out of Memory error.

So I was really excited about this thread local Garbage Collection that was, I read a paper from a former coworker, I keep track of my coworkers, so a former coworker published a paper on thread local Garbage Collection that made me intrigued [Editor's note: Eva provides some links: http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/local-gc.pdf and http://www.google.com.br/patents/US8024505 ], I haven’t been that intrigued quite some time, that was maybe two years ago, I don’t know if that has become reality in a commercial version of the JVM yet, maybe it has maybe someone claims they do but you know that paper is what I’m intrigued about not marketing pitches around what Garbage Collector do or don’t do. The second thing is of course the progress of Java 7 and Java 8 and what is happening on JVM level there around support for more dynamic languages, but it’s less about Garbage Collection in that sense. I’m looking at the Nashorn project which is a lot of cool compiler optimizations, code optimizations, code generation optimizations for JavaScript for instance, but it will benefit other dynamic languages as well and it’s pretty impressive what those guys are doing. But Garbage Collection-wise I’ve heard a lot about G1, I haven’t seen that much yet. I’m still hopeful.