Stefan's Blog » Ari Zilkahttp://www.datameer.com/ceoblog
Big Data Musings From Datameer's CEOSat, 28 Feb 2015 00:17:42 +0000en-UShourly1http://wordpress.org/?v=4.1.1Special “Use Cases” Big Data & Brews: Hortonworks, Pivotal, MapRhttp://www.datameer.com/ceoblog/special-use-cases-big-data-brews-hortonworks-pivotal-mapr/
http://www.datameer.com/ceoblog/special-use-cases-big-data-brews-hortonworks-pivotal-mapr/#commentsTue, 17 Jun 2014 17:04:33 +0000http://www.datameer.com/ceoblog/?p=513We’ve had a number of guests visit us since we’ve kicked off Big Data & Brews and one thing that I always like to ask them about is what kind of use cases they are seeing with their technology. Tune in below to hear what Ari Zilka of Hortonworks, Milind Bhandarkar of Pivotal, and Tomer Shiran and […]

]]>We’ve had a number of guests visit us since we’ve kicked off Big Data & Brews and one thing that I always like to ask them about is what kind of use cases they are seeing with their technology. Tune in below to hear what Ari Zilka of Hortonworks, Milind Bhandarkar of Pivotal, and Tomer Shiran and Ted Dunning of MapR had to say.

TRANSCRIPT:

Ari Zilka, Hortonworks

Stefan:What are some of the most exciting used cases you see with the customers, especially around those integration, scenarios you just mentioned?

Ari:There’s some very mundane used cases that are exciting to the customers. Let me start there, like operational data store collapsing. I find most customers have hundreds to thousands, if you’re talking about a big Fortune 500 company. They will literally have thousands of copies of the same data, for compliance reasons usually. Different teams can have a piece of different pieces of data and they want to silo them for each other to have zero regulatory risk.

Our data lake architecture allows us to basically say, “You know what, you may have 3,000 files in Hadoop data lake but at least it’s just a data lake.” Sometimes it turns out we can collapse the redundancy down to … it’s just really three copies at the lake, usually you can because we can do column level encryption with the Columnar Store and we can say, “This guy is, based on log-in, is allowed to decrypt the column, this guy is not.” We do a lot of format preserving encryption. You can use the column to do analytics it’s just not the … like a Social Security Number it’s not the real one. It’s just a unique one processing.

The mundane used cases are collapsing out all these redundant copies of data that saves people a lot of money. The sexy used cases are Data Science, like I just spoke to our Head of Data Science before coming to meet you and he found … for example … I can’t say which customer this is with [00:08:00] but in health care he built a new algorithm in Hadoop. He on-boarded all these structured data sources, forget Hadoop for unstructured data. This is all databases and then some Office style documents like PDF’s of X-rays and things like that from hospitals.

Stefan:Maybe doctor notes, who knows?

Ari:You bring all these data in. You do some natural language processing. You do some pharmacy history data analysis and he found that he could write an algorithm literally, NR, he could write and algorithm that found the relationship between diseases and drugs. These are the drugs people take and these are the diseases they have and these are drug … the time correlation between them and he started asking questions of this health care provider’s doctor.

He said, “There’s something very strange going on. It seems like anyone who has atherosclerosis, calcification of the arteries … There’s a very strong correlation between them and taking flu medication.” The doctor said, “What are you talking about? That makes no sense.”

They started looking into it and a week later the doctors came back and said, “You know what, the calcification gets really, really bad for someone on the flu. The arteries hardened for a short period of time even worse than when they were hardening in the natural state of that person’s health before and they will soften again a little bit after the flu passes but during the flu the arteries lock down and it’s an immune response and the person is at huge risk of heart attack.”

We figured out together with this health care provider that you could actually say, “If someone has atherosclerosis and they come to you with the flu you need to treat that as an emergency situation because if you don’t get that flu dealt with they could have a heart attack. More importantly is anyone with atherosclerosis should just get a flu vaccine every year [00:10:00] and not be at risk of catching the flu.”

Then he found something much more interesting, which is HIV positive patients were taking prescription only hyper fluoridated toothpaste in huge volumes and no doctor could tell him why and it turns out that the drug cocktails to treat HIV weaken the teeth. Dentists stepped in and prescribed hyper fluoridated toothpaste.

Stefan:Interesting. Isn’t it wonderful what you find out about it … things they don’t.

+++++++++++

Milind Bhandarkar, Pivotal

Stefan:Now you’re working on this and you’re one of the really really early guys in the Hadoop space, and you’re working on this. Where was that moment when you had to pinch yourself, “I can’t believe people are doing this with software I wrote so many years ago?”

What’s the most amazing use case you saw?

Milind: First was not the use case. Once I started doing Hadoop evangelism outside of Yahoo! … by the way, the first Hadoop tutorial delivered anywhere was in ApacheCon in 2008 or 2009, I am forgetting, but this was in New Orleans.

Stefan:I was there.

Milind:It was sponsored by Candera. Christoff, Aron Kimpbell, Tom White, all of those actually were …

Stefan:Didn’t we go in the evening? Anyhow …

Milind:The French quarter thing, let’s push that out. Really, I was in [Usemix], the year after that. My tutorial proposal got accepted there. The next door there was a Solaris performance tuning tutorial going on with Richard McDougall.

Richard McDougall has written Solaris performance tuning books, all about Solaris. A really great guy. His tutorial had like six people attending them and my tutorial next door had something like 30 people attending. That’s the point where I basically realized …

Stefan:Something is shifting.

Milind:Something is shifting, exactly. Usemix 2009, this was in San Diego. Among the attendees in my tutorial there were three people representing all three different. That was basically, “Okay, what have we done?”

Stefan:Saving the world.

Milind:Saving the world, yeah. Recently actually, my daughter took part in the Synopsis Science Fair, here in South Bay. I went to drop her there and I took a look at what all kids were doing, from 7th grade to 12th.

There was actually a kid from the 7th grade, who did … what was his title? “Effect of number of computers on computation time.” He basically took a MapReduce job and he basically said, “If I run this on three machines, if I run this on ten machines, if I run this on 20 machines, how much does the compilation time changes?”

He discovered that it goes down for some time and then it basically goes back up. This was all done using Amazon AWS and Hadoop. I actually was tempted to make him an internship offer right there. I don’t know about underage recruiting or anything like that.

+++++++++++

Tomer Shiran, MapR

Stefan:That’s pretty amazing, it’s really international. What a lot of people are doing, is it different than like what people do in the US versus in Europe, used case wise or, are there buckets?

Tomer:I’d say the US is probably still ahead in terms of the maturity of the customer base, although we have actually a significant number of customers in these other countries. One great use case in Japan is, it’s actually a beverage company, so this is one of the biggest companies, beer and whisky in Japan.

Stefan:Oh nice.

Tomer:They have some pretty cool use cases, so I think you would get the standard kind of marketing use cases that people do with Hadoop. They have all of those but that they also have these really cool vending machines, where they are doing image protection and they have a video camera that’s looking at you and kind of recommending a beverage tea when you walk up to.

Stefan:Based on what they used before.

Tomer:I think they look at your image and compare it to other people that had similar characteristics, things like that. So it’s a pretty critical cool use case.

Stefan:That’s so Minority Report where you are basically based on your, like what was it retina get the advertisement. I guess we live in the future. That’s amazing.

Tomer:Yeah, we do.

+++++++++++

Tomer Shiran, MapR

Stefan:What are some of the use cases you are seeing with the customers, like what are your favorite one?

Tomer:My favorite one is actually, it’s not going to be one of the more popular use cases but the Aadhaar project in India actually so.

Stefan: I think that’s just an enormous project.

Tomer: Yeah, it’s a really cool one too and it’s really valuable in terms of what it’s doing in that country. So India has over 1 billion people living in the country, it’s something like 1.25 billion people. And one of the challenges there is that about half of the population doesn’t even have an identity. There is no Social Security Number or anything like that, and that prevents these people from opening bank accounts, it prevents them from getting a medical care, it prevents them from getting government services, government aid, things like that. It also encourages a lot of fraud in the overall system, right.

So what the Aadhaar project is doing is it’s basically building the world’s largest biometric database, and ideas to provide every resident of India an identification so they can get government aid and medical services and open a bank account and do commerce and things like that. And it’s lifting a lot of people out of poverty.

Stefan: It’s fantastic.

Tomer: I think it’s up to about 750 million people already in the database, I think it’s about 10 petabytes of data. So for every person you have the person’s photo of the face, you have the ten fingerprints, the two iris scans. So you have all that information for everyone of these people, and it’s not just collecting that information and storing it, it’s also enabling every point of service in India to also be able to verify that identity, because now you need the bank and every other service provider to be able to check your identity. So that’s a system that needs to respond within 200 milliseconds at very, very high load in terms of transactions per second. So we are really happy that we are powering that from the backend, from Hadoop in database standpoint. So that’s one of the projects I’m most excited about.

+++++++++++

Tomer Shiran, MapR

Tomer:I think advertising and marketing are pretty common use cases across the board, and it …

Stefan:And is that more in the ad companies or is it more kind of the traditional big companies trying to understand their customers or?

Tomer:It’s actually both, so you look at some of our customers like the Rubicon Project, which is the largest ad exchange in the US in terms of audience reach. And they are doing 90 billion auctions at auctions every day. And each of those auctions is probably a dozen or more bids, so all these bids are, we are talking about trillions of events every month that are processing the cluster, and they map our environment, and they predict the prices that the auction are going to and all sorts of things like that.

But then if you look at many other customers that we have across telco and retail they are doing, these are customers that have tens to hundreds of millions of end-users or end customers, and they are doing everything from better ad targeting to turn their analysis, all those types of used cases.

Stefan:What kind of product enhancements try for you guys, like where is that, you touched a little bit on the lower latency requirements but where do you really see Hadoop as of today, you said little bit expanding into the new real-time-ish production use cases, what other functionality dimensions are driven by those use cases?

Tomer:I think the customers that are doing these things. And I think you mentioned earlier how you see a lot of these, a lot of our customers are doing big deployments that are really impactful to their business. When the company wants to do that, they need a set of enterprise grade dependability characteristics. So they want true high-availability, one that self-heals automatically. They want a real consistent snapshots, they want disaster recovery across data centers. So we have a vendor now who says, we have those things and we’ve added those things, we’ve caught up with MapR. But there is a difference between building those into the architecture and doing something for a checkbox.

Stefan:. Is it like, oh yeah, we love pizza and we just put something on the fly.

Tomer:So, let’s take an example of snapshots. So MapR we’ve provided snapshots from day one, much like you would see in any other, in an enterprise storage or an enterprise database, the ability to go back in time. Let’s say user accidentally deleted data, or you had an outage and you wanted to go back to a consistent point in time. So it is something that enterprises expect, you wouldn’t buy or a database if you couldn’t go back and do point in time recovery.

MapR is only Hadoop distribution that provides that from a Hadoop standpoint. And our competitors they’ve tried to add that to HDFS, and the result is really inconsistent snapshots or what’s they – they sometimes call fuzzy snapshots, but people don’t …

Stefan:That’s a really nice marketing term, by the way.

Tomer:It’s great.

Stefan:It’s a fuzzy snapshot.

Tomer:It’s more or less consistent, it’s sometimes consistent.

Stefan:Let’s hope it is consistent.

Tomer:Let’s hope it is.

Stefan:The whole thing just crashed, let’s hope.

Tomer:And as the Hadoop market has matured over last year and we’ll continue to mature over the next year or two years, people stop buying those arguments. They don’t comprise when they buy a storage system or database. No, they are not going to comprise when they buy a Hadoop environment.

+++++++++++

Ted Dunning, MapR

Stefan:Let’s come back to the recommenders. What are the kind of use cases you see people using?

Ted: Recommenders are just amazingly ubiquitous lately. A friend of mine, co-contributor Robin O’Neil, just recently showed me that the new Google Maps is almost entirely based on recommenders. There’s way, way too much stuff to show on any map. There’s a massive amount of stuff. You wouldn’t be able to read it. What it does is based on what you’ve done lately, what you’ve clicked on, what you’ve typed in, it selects which things it wants to show you. This actually now in my talks, I show what happens if I search for a restaurant by name [00:10:00] near our office. It shows all these restaurants in the same price range, roughly the same cuisine and I search map our office is. All the restaurants go away and all these high tech offices on the map show up. It knows which sort of thing im’ doing.

In fact, the demo used to work better, because now it knows that I search for restaurants in that neighborhood. So it starts showing.

Stefan: It makes this the technology in the restaurants.

Ted: That’s right. It’s learned already some aspects of what I like. There are many, many things that it does. It will deemphasize roads if you’re a bicyclist or you’re on mass transit and countervailing approaches be done, too, different scales it might show you more roads.

Stefan: Is there data privacy issue with this?

Ted: There are data privacy issues everywhere and people really, really don’t recognize how ubiquitous they are. Google has a pretty darn good track record and they’ve taken a lot of efforts. Ultimately, just like search histories, those are sensitive and even though they’re not legally sensitive, Google is doing the right thing as treating them as very sensitive and being pretty careful. We’ve used Google as a partner with the Google compute engine and I’ve been very impressed, for instance, the discs that you get on the virtual instances are encrypted by default. In fact, I don’t know how to defeat that. If you just do a toy example, the data at rest is encrypted.

Maybe you should do a better key management so it’s always changing or whatever, but at least the zero, the simplest case, is done well. So yeah, I think that there is always issues about privacy. There are really subtle things you can do with big data, a whole value, but those things can also be used to invade privacy.

]]>http://www.datameer.com/ceoblog/special-use-cases-big-data-brews-hortonworks-pivotal-mapr/feed/0Special “How It Works” Big Data & Brews: On Spark, HBase and Project Stingerhttp://www.datameer.com/ceoblog/special-works-big-data-brews-spark-hbase-project-stinger/
http://www.datameer.com/ceoblog/special-works-big-data-brews-spark-hbase-project-stinger/#commentsTue, 27 May 2014 19:58:47 +0000http://www.datameer.com/ceoblog/?p=453We had a few more great chalk talks that I wanted to share in this week’s special “How it Works” Big Data & Brews. Pull up your favorite brew and hear about some of the cool Hadoop open source projects from former Quantifind CTO Erich Nachbar who shares how Spark works, Michael Stack about HBase and Hortonworks’ Ari Zilka […]

]]>We had a few more great chalk talks that I wanted to share in this week’s special “How it Works” Big Data & Brews. Pull up your favorite brew and hear about some of the cool Hadoop open source projects from former Quantifind CTO Erich Nachbar who shares how Spark works, Michael Stack about HBase and Hortonworks’ Ari Zilka about Project Stinger.

TRANSCRIPT:

Stefan: Tell me more about Spark you spend a lot of time with Spark help us to understand maybe; most architects are obviously moving pieces why do you like Spark? If you’d like to there’s markers the chalkboard so feel free, but so how did you get to Spark and what is exciting about that.

Erich: I met Mate who is the main contributor or founder of Spark at one of the Hadoop meet ups where he was show casing his project. He used to be a Grad student out of Berkley UC Berkley and that was his project. At that time or still is small somewhere on the, order of 15,000 lines of scala code. He thought what if I could give people the option of either process my data similar to Hadoop where I load something from disk to the processing and spill it back.

It can say hey you know what if you have enough RAM you can also just say hey cash this particular dataset in memory and then apply these operations in memory. It works in both ways and what he did to ease the transition is he is building everything on top of HDFS as like the bases so he can use any input format as like the [00:10:00] source.

Stefan: It’s a Hadoop input format to get E [inaudible 00:10:04].

Erich: Correct so they can process [inaudible 00:10:07] files, but the point is that in our case for example we run this on MapR, but any Hadoop history [inaudible 00:10:15] would actually do and then on top of it is typically …

Stefan: Why are you using MapR?

Erich: Well if you look at the other distribution, MapR I think has a kick ass, file system. You can NFS mount it on the data scientist boxes and just fax us the data. It’s obviously limited to the speed of the network interface, which is not suitable for large amounts of data, but it’s good enough if the data scientists say, “Hey I want to just like poke at this data load it in R and then play around with.”

Stefan: Sorry I interrupted you run it by HDFS and then printable format.

Erich: Yeah and so what you would do is it would co-locate in good old Hadoop fashion the data notes would be co-located with the Spark workers. What you would have is this is weird because I’m mixing physical with logical if you would say this is a data node – run me a Spark – Spark worker – worker in here and so you would get data locality. With the input format the IP address, it would actually find out so you would get the same locality advantages that you would have with Hadoop and …

Stefan: Spark then directing the data the Spark will flow directly to [inaudible 00:11:37] of the data node or how is it getting data load covered [inaudible 00:11:42].

Erich: Correct you would install them on that same box.

Stefan: Okay then you’re accessing HDFS in the same let me think about this how would you, integrate them to name note.

Erich: The name is always…

Stefan: Is it bypassing just straight going off in the …

Erich: Yeah, but name note is really just [00:12:00] it would find out where the blocks are located and then it would schedule the jobs under…. It works very much similar to what the Passtracker would do.

Stefan: Yeah, okay.

Erich: It’s pretty much the same thing.

Stefan: Then they are just implemented a fake top strike or …

Erich: It’s still encapsulated you would only use the file system you wouldn’t use any of the job cast strikers off of that at all.

Stefan: Because I think the [inaudible 00:12:18] itself by Hadoop are not going to be open for this way to access data locality I guess I just take the whole thing.

Erich: I think you can actually get you can get to it I thought so yeah.

Stefan: All right.

Erich: It seems to work yeah and you have SparkMaster it has a little webeye that …

Stefan: Then you push your jobs to SparkMaster and it distributes to all, the Spark…

Erich: The rest works the exact same as the thing so that one of the big differences that they do that is a very impressive is if you look at …

Erich: One of the cool things that Spark actually does is when you’re in Hadoop land you have a jar that is your job and it gets pushed out to all the nodes and if you have a larger job that causes a person overhead…

Stefan: Yeah.

Erich: What Spark does is let’s say you have a map operation so I have a collection. Map because I’m running a map operation on it. I could let’s say this is my record so there’s this color coded record and I’d say record x 2 x 2 this would since a functional programming actually this result would be admitted as the result of that operation and …

Stefan: Any closures.

Erich: What it actually does it serializes only the byte code for that [00:14:00] closure pushes that out to the nodes so we actually it was so interactive that we actually drove for prototyping purpose we drove that to our customer front and through running jobs.

Stefan: Okay.

Erich: Starting a job takes maybe a half a second and even less than that because it’s efficient about like what it actually pushes out to the node then when you have it in memory you really just run whatever it is over in the nodes.

Stefan: You kept all the data in memory all of the time then?

Erich: Yeah I mean clearly this is if you have a big CPU problem this is great and if you can afford the RAM.

Stefan: Right.

Erich: If you have plenty of …

Stefan: [Inaudible 00:14:38].

Erich: Petabytes of data…

Stefan: It’s all right.

Erich: Exactly.

+++++++

Stefan: For the people that don’t know HBase high-level, how is that working? What’s the difference between …

Michael: He wants me to draw on the board. He’s going to regret it.

Stefan: I can fill your glass more up, then throwing it together.

Michael: I don’t know if you’ve read the big table paper. We’re pretty much the same.

Stefan: Most likely people seem distant.

Michael: It’s that rare commodity. It’s a well-written [04:00] paper.

Stefan: It’s actually true.

Michael: It’s actually understandable.

Stefan: It’s actually the whole reputation of HBase fast paper was actually written very well.

Michael: There’s nothing more than a big table like in your Excel table, and then you have rows. Except this goes like billions and then what happens is you take this table and you break it into pieces and then this piece you put it on a server of some kind, a regent server. I think this is going to go bad.

Stefan: No, it’s good. It’s awesome.

Michael: You can have many of those. Then each one of these regions, there can be many of those. You could have one region, or this could have like hundreds of regions.

Stefan: Peer region server.

Michael: Peer region server.

Stefan: Do you add more data in between or do you just append like an HDFS file systems. Can you insert so to say?

Michael: I suppose that’s where HBase comes into play. We add the random read-write to …

Stefan: You can basically update individual rows, and you can add things.

Michael: You know like small. HDFS or even map producers usually talk about doing terabytes, being fluid, streaming through loads of stuff. What we add to the family is the random hook-up of little bits.

Stefan: They say in general a queen or master server can manage all of this?

Michael: I never heard it called a queen. I think I’m going to call the queen for now. We have a master process [06:00] that coordinates all the region server processes.

Stefan: I assume regions are then replicated between multiple servers.

Michael: The thing is we run actually on HDFS, right? We just write to HDFS.

Stefan: So HDFS is taking care of the replication?

Michael: As you know HDFS it does the replications, so when we write we write to three replicas.

++++++

Stefan: Tell me a little bit about your work at Hortonworks, what’s the most exciting project that you guys are working on right now?

Ari: The most exciting stuff we’re working on right now is inside Project Stinger. There’s two exciting things. I’m going to erase this diagram or try to.

Stefan: It’s style if you have multiple layers.

Ari: Inside Project Stinger, there’s two really exciting things we’re doing. One is on the storage and access to data layer. It’s ORC files and vectorization. Super exciting and we’ll help anyone …

Stefan: That’s Owen’s child?

Ari: Actually, no…

Stefan: No?

Ari: He never wants to admit it but ORC is called Optimized RC, in reality, it’s Owen’s RC, but he’s not arrogant so he doesn’t buy that.

Stefan: No, he’s not. I know Owen since a while.

Ari: Yeah. On the storage site, it’s ORC files and vectorization on top of those ORC files. Then on the factoring side, on the architecture side, it’s projects Tez and YARN.

Stefan: Now Doug Cutting would come in and said, “Look, you know, I can do that much better with… [00:12:00].

Ari: Sure. At the end of the day, ORC is a format contributed by Microsoft’s super geniuses inside the PDW team. These are actually guys who have spent in some cases 30 plus years in data storage formats for relational workloads. ORC has some things that are just superior to anything else on the Hadoop platform right now.

Stefan: Like?

Ari: Like block level indexes and type-aware indexes. It’s a columnar format. I’m not going to bother to sketch up … maybe I should.

Stefan: Yeah.

Ari: But I mean if you have a block like this, you really want to store all column one values and then all column two values and then all column three values. Typically, you call this a columnar store. Some advantages we get and most columnar stores, all columnar stores do that. That’s not interesting.

Advantages you get is you can take column one and compress it because now you have more …

Stefan: Most likely better compression.

Ari: Most likely better compression because you have more consistency across value space.

Stefan: Luckily, very frequently Hadoop files are sorted, so even better compression.

Ari: Yeah, exactly. Like you take a web log and this would be IP address. This would be Access port, this would be browser type. Browser type is going to be what now? It’s going to be Chrome or what’s it’s called? Chrome or Safari or Internet Explorer and that’s it.

Stefan: No Internet Explorer anymore.

Ari: What we’re columnar and we can actually handle columns where the individual record sizes are quite large. We’re type-aware which means you tell us if this is an inch or some kind of numeric value or long. You tell us that this is string [00:14:00] and then we start doing things that are type-aware. We are sequel and Java-type compliant which is superior to anything else, but then we have an index as part of each block.

Stefan: Even now you get the block much faster?

Ari: Yeah.

Stefan: How big is the block size?

Ari: It’s configurable but a good block size is like a gigabyte.

Stefan: Okay. Well then it makes to have a indexing file.

Ari: For example a string index would be a dictionary. We dictionary encode strings and we write all the unique strings up in the index. We compress them down and then we write an integer look-up.

Stefan: Yeah, makes sense.

Ari: If you have things like URLs, they repeat a lot. The URLs are actually going to just be URL 1, URL 17, URL 22, URL 33. Integers we’re going to have min, max, average, things like that.

Stefan: Per block?

Ari: Per block. We’re going to have start date, end date.

Stefan: Okay. It makes it small and really fast.

Ari: We find it. We do it and then what you can do then is you can basically use the index to skip a block. When you’re doing filters and aggregations, you want to basically say, “There’s a where clause. There’s a query predicate. I can apply the query predicate to the index.”

Stefan: Right. You don’t even need to …

Ari: Right. I don’t even need to hydrate the rows or columns at all.

Stefan: That’s a slow pod, the deserialization, the inflection.

Ari: Yeah. This is row one, this is row two, this is row three. Vectorization is the classic Java loop with my connection, I get a result set. What result set dot has next, then resultset.field1.field2.field three next result. In a while loop, that while loop kills performance because you’re iterating through a result set and you’re paging data from various large ram pools into processor L1 cache.

What we’re doing in [00:16:00] vectorization is saying, “Leave this block, dehydrate it, flattened, unmarshalled as a block. You can look inside the index as much as you need to. When you do pass across the block, vectorize the query predicate.” Turn the query predicate into scalers and then pass it across this as a mask. You’re basically looking at a giant set of ones and zeros.

Stefan: You basically just overlaying …

Ari: You’re saying, “I’m looking for the following pattern, 1011.” It says, “There is 1011.” Then you say, “Oh. Well, I want 1011 this way.” It says, “Okay. I see here right here. It’s row 23.” Then you pull out row 23.

Stefan: Then you’ve even done these with [inaudible 00:16:44]?

Ari: I’m moving a gigabyte at a time through the data. Obviously L1 caches are on the order of megabytes. The move of a gigabyte into a megabyte L1 cache or a 16-mega L1 cache will take you hundred of clock cycles or a few thousand clock cycles which means it will be done under a second. We literally found a single laptop could manage a terabyte search in under two seconds with vectorization.

The ORC file is tied to the vectorization and the ORC file is tied to a lot of stuff we intend to provide to technologies like Datameer on top of us which is this global level index. If we take these indexes, you can use them now to show people ontologies about their data.

I have a column. I know its name. I know its type and I know its value range. I can bring the dictionary forward into data mirror. I can bring the integer min max data time-stamped values into datameer and that becomes dimension data or anthological data that’s very interesting to the end user. Even though the system doesn’t know what it is, it’s very interesting to the end user.

I can use to speed up searches because I could skip blocks. [00:18:00] Really what we’re talking about is take that index, centralize it to the entire table space or data set if you will and then proffer that up to anyone who wants it. Now you can build systems that actually service queries without ever looking at a record at all.

By the way, you paid no price to compute it except on ingest. You laid it out per block.

Stefan: That will be my question on how is that impacting right performance?

Ari: You’re packing a gigabyte block. You’re consuming some extra memory and you’re type-aware so you’re marshalling the types on write down into disk. Our write performance for an ORC file needs to get better but right now we have tuned it so it’s faster I think to write an ORC file than an RC file for example.

Stefan: That wasn’t great. It’s always a question of who you are performing for… [ 00:18:58].

Ari: Well, we’ve benchmarked ORC file writing against anything we can get our hands on. We find the best things out there. We’re competitive with. Well, I don’t even want to name specifics but I don’t see a problem. I see a problem in performance relative to an absolute number I’d like us to get to so we can write tens of megabytes a second across a cluster. I want to write hundreds of megabytes a second but no one is writing that fast right now.

]]>http://www.datameer.com/ceoblog/special-works-big-data-brews-spark-hbase-project-stinger/feed/0Big Data & Brews: Ari Zilka, CTO of Hortonworks on Larger Hadoop Ecosystem & Visionhttp://www.datameer.com/ceoblog/big-data-brews-ari-zilka-cto-of-hortonworks-on-larger-hadoop-ecosystem-vision/
http://www.datameer.com/ceoblog/big-data-brews-ari-zilka-cto-of-hortonworks-on-larger-hadoop-ecosystem-vision/#commentsTue, 28 Jan 2014 20:53:08 +0000http://www.datameer.com/ceoblog/?p=244TRANSCRIPT Stefan: Welcome back to Big Data & Brews with Ari from Hortonworks. Ari, we’ve got a community question. The community question, if I can rephrase it slightly, what are you bringing to the table as a new CTO of Hortonworks? What’s really your focus? Ari: What I bring to the table as Hortonworks’ CTO […]

The community question, if I can rephrase it slightly, what are you bringing to the table as a new CTO of Hortonworks? What’s really your focus?

Ari: What I bring to the table as Hortonworks’ CTO is the combination of the inbound-outbound focus. I think our CTO level focus in the past was more on the future road map of the core technology, so working with the academia, working with start-ups, working with big established software and hardware vendors on where can Hadoop go over the next 10 years.

I’m focused a lot less on that and a lot more on what are people trying to do with it now and what are gaps, because I see Hadoop as … It’s interesting I met with a customer who called folks like you and me Hadoopers, people who are inside the circle of trust, if you will, and who understand all the terminology and it all makes sense to us. We live and breathe it every day and ideally we’ve used it … for a while in anger.

You have the Hadoopers and then you have mainstream enterprise and when Hadoopers talk Hadoop there’s this death by a thousand paper cuts, things that are just understood. They are painful but they are lived with and accepted. When you show those to the mainstream they say, “That’s unacceptable. I won’t even adopt it.”

Stefan: Like the writable and stuff and the serializeable interface. Or the no way to overwrite that communication protocol?

Ari: Right, or security is inherited from file system level, posixs level, concepts, constructs and that’s all you’ve got.

Stefan: Good luck with that.

Ari: You don’t actually have ackels you just have user group and other-

Stefan: Right [00:02:00], and if you’re in health care and there’s legal requirements for you to do that, who cares?

We talked … we went really deep, so help me a little bit more to paint a picture of your overall ecosystem. What are you guys doing? What’s really cool about your platform?

Ari: From an ecosystem perspective that’s a great question. What we do that we feel is unique is we build a platform for others to build data applications on. We are the data management platform.

I like to think of us as Amazon, EC2 if you will. We want to get it right in terms of what services do you need to assemble platforms, so we treat ISV’s and end users equally, as opposed to other platforms which say, for example Hadoop is fine except for the implementation. It’s too slow, so I will build a faster implementation that is purely an IT sale. Hadoop is Hadoop get what you want from Apache and run it on my new Runtime.

There’s folks who I compare to like a BEA / JBOSS, the difference is only the container but they all implement the same standard. Then there are folks who say, I derived from Hadoop core, so I will build a database company or an analytics company on top of Hadoop core technology and you will consume my package product which happens to use things like HDFS, maybe math produced maybe not, may be USI … all the different components from the community project ecosystem to build a particular product like a database.

You essentially have database vendors, you have container optimizations for IT and then you have Hortonworks which is trying to build a general purpose data platform for both ISP’s and end users to do analysis and build tools to do analysis on. That [00:04:00] leads to a difference in ecosystem around us versus everyone else. First of all, we have a lot of the big bucks vendors instead of competing with us, aligning with us.

Stefan: That would be Microsoft?

Ari: Microsoft, Teradata, SAP, Red Hat, Rackspace are all already aligned with us. Folks like Oracle and IBM work with us every day because customers pick the combination of their superior databases with our superior Hadoop platform and say, “I want Hadoop from Hortonworks and relational from Oracle and you must work together.” We work together fine.

The products are integrated. You go out to market and you get this view that perhaps they have a particular alignment like IBM ships its own Hadoop or Oracle ships something inside the big data appliance and that’s all their support. They support anything but then you have vendors who have premier integrations who are such as Teradata or SAP or Microsoft. Microsoft, Cloud, [inaudible 00:05:00] cloud is built on top of Hortonworks.

SAP HANA bridging to Hadoop which is a really sexy type of capability. You have terracotta-style low latency fast access to structured data through HANA but you have Hortonworks working as a data lake underneath HANA preparing, cleansing, and materializing data into HANA as fast as you need it to. Sometimes the other way around HANA’s preparing it to load into Hadoop for long term storage but you have this two-way flow of data where HANA’s the low latency layer and we don’t have to solve that problem, meanwhile, Hadoop is a data lake underneath it.

Then you have Teradata where you have them handling big data at low latency and random access and us handling big data at medium latency and non-random access but batch access and we put the two together and say, “Okay, well, again we become the data lake for Teradata, slightly different from HANA to SAP.” Where with Teradata we’re saying, “You’ve built [00:06:00] an ecosystem of tools around Teradata, keep that in place but you want to grow your analytics capabilities without growing your entire warehousing footprints.”

Let’s bring those warehouse images on to Hortonworks data lake for long term retention and joining with new interesting analytics and new data sets and new work loads and do new analytics in the lake and do existing infrastructure, leave it in place.

Stefan: What are some of the most exciting used cases you see with the customers, especially around those integration, scenarios you just mentioned?

Ari: There’s some very mundane used cases that are exciting to the customers. Let me start there, like operational data store collapsing. I find most customers have hundreds to thousands, if you’re talking about a big Fortune 500 company. They will literally have thousands of copies of the same data, for compliance reasons usually. Different teams can have a piece of different pieces of data and they want to silo them for each other to have zero regulatory risk.

Our data lake architecture allows us to basically say, “You know what, you may have 3,000 files in Hadoop data lake but at least it’s just a data lake.” Sometimes it turns out we can collapse the redundancy down to … it’s just really three copies at the lake, usually you can because we can do column level encryption with the Columnar Store and we can say, “This guy is, based on log-in, is allowed to decrypt the column, this guy is not.” We do a lot of format preserving encryption. You can use the column to do analytics it’s just not the … like a Social Security Number it’s not the real one. It’s just a unique one processing.

The mundane used cases are collapsing out all these redundant copies of data that saves people a lot of money. The sexy used cases are Data Science, like I just spoke to our Head of Data Science before coming to meet you and he found … for example … I can’t say which customer this is with [00:08:00] but in health care he built a new algorithm in Hadoop. He on-boarded all these structured data sources, forget Hadoop for unstructured data. This is all databases and then some Office style documents like PDF’s of X-rays and things like that from hospitals.

Stefan: Maybe doctor notes, who knows?

Ari: You bring all these data in. You do some natural language processing. You do some pharmacy history data analysis and he found that he could write an algorithm literally, NR, he could write and algorithm that found the relationship between diseases and drugs. These are the drugs people take and these are the diseases they have and these are drug … the time correlation between them and he started asking questions of this health care provider’s doctor.

He said, “There’s something very strange going on. It seems like anyone who has atherosclerosis, calcification of the arteries … There’s a very strong correlation between them and taking flu medication.” The doctor said, “What are you talking about? That makes no sense.”

They started looking into it and a week later the doctors came back and said, “You know what, the calcification gets really, really bad for someone on the flu. The arteries hardened for a short period of time even worse than when they were hardening in the natural state of that person’s health before and they will soften again a little bit after the flu passes but during the flu the arteries lock down and it’s an immune response and the person is at huge risk of heart attack.”

We figured out together with this health care provider that you could actually say, “If someone has atherosclerosis and they come to you with the flu you need to treat that as an emergency situation because if you don’t get that flu dealt with they could have a heart attack. More importantly is anyone with atherosclerosis should just get a flu vaccine every year [00:10:00] and not be at risk of catching the flu.”

Then he found something much more interesting, which is HIV positive patients were taking prescription only hyper fluoridated toothpaste in huge volumes and no doctor could tell him why and it turns out that the drug cocktails to treat HIV weaken the teeth. Dentists stepped in and prescribed hyper fluoridated toothpaste.

Stefan: Interesting. Isn’t it wonderful what you find out about it … things they don’t.

What are … Let’s shift … That was great.

Let’s shift a little bit gears here again, so outside of the Hortonworks platform and all the stuff you see, what is the most exciting technologies you seeing out there right now?

Ari: The most exciting thing -

Stefan: You know, if you go on GitHub, what is like, “Oh?” What is so … what are you listening or subscribing to on GitHub? Or anything like this … what’s really cool?

Ari: The stuff I’m paying attention to right now is around streaming, so Storm, Samsa, Continuity, and -

Stefan: Kafka?

Ari: Kafka for sure.

I really like these micro batches and transactional consumption of stream being events. Being able to consume events, streams hundreds at a time in a pseudo-transactional fashion, so we’re not paying that silly XA price anymore where people are getting reasonable reliability to their consumption or at least stable use of data.

The other thing I really like is machine learning, graph processing. I really think that there is … First and foremost there’s a missing APR layer to Hadoop. One of our competitors wrote a blog … I think it was last week that asserted that there are only two workloads that matter in Hadoop, Sparks and SQL, which is just totally wrong. Totally wrong. Hopefully for obvious reasons but we’ve got to onboard data, you’ve got to manage data, so there’s a whole data [00:12:00] management layer in it of itself that’s Spark is not for nor is SQL, for Classic ETL or ELT type workloads.

Separate of that there are non-iterative analysis just batch workloads that you can do joining a data sets. I don’t need to do that in Spark, to join a table to another table or do a customer master analysis, for example, let me find the customer across channels, a 360 degree view. What it is to remain to something like Spark, is machine learning and graph processing.

I think the world is on the cusp of a breakthrough in scaled out machine learning. The dirty little secret of Hadoop is machine learning is really done on giant in-memory notes for the most part. They’re SAS read but then SAS historically was done on a big in-memory servers, R is done on big in-memory servers.

Stefan: You need the shared memory right?

Ari: Yeah.

Stefan: That’s a problem.

Ari: You want to load your whole data set into memory and then it will rain across it because you’re constantly doing things. Let say you want to do Churn Analysis. I lost customers … I’m Amazon and I lost these customers and I retain those customers across a one year boundary. What is the difference between them? What are the factors? What are the characteristics of lost and retained customers that in the future I can look at some of them and say, “He’s trending toward a loss by end of this year?” Is it his age? Is it his buying patterns? The department he buys in? The number of visits? What do I have to worry about to retain and grow my customer base?

That Churn Analysis, you typically will take 500,000 people into a corpus and start examining them A versus B type testing and you want to do that in a machine learning fashion, in memory because you’re going to say, “Is it field 1, AGE? Is it field 2, LOCALE? Is it field, 3 SOCIO-ECONOMICS? Is it field 4, PAYMENT TYPE? Is it field 5, THIS GUYS ALWAYS GETS GROUND, THAT GUY ALWAYS GETS FED EX?” [00:14:00]

At some point, someone did analysis, for example, that lead to the creation of Amazon Prime, that said, “I can have a bunch of people pre-pay for their shipping, fund everyone else’s shipping and keep these people happy because they get everything in two days even though they pay more of it, you know, and I won’t lose money. I’ll end up making money on the whole thing. ” So, that’s a business analyst. That’s a one-time very heavyweight process.

You really need to be able to do things like that in an iterative like, “I want to discover what are the relationships between two groups? And what is the right segmentation between groups?” That’s done in an iterative fashion typically done in memory because I’m constantly revisiting the same data over and over. I think that something like Spark is starting to crack the nut on, “How do I load all of that volume of data across a cluster’s RAM and then start crunching on the patterns at scale?” That’s where everyone wants to go.

I really don’t want to do … I talked to my customers about segments of one. I really don’t want to do the classic Nielsen thing where I pick a thousand families and give them a set-top box and say, “The whole country will watch what the thousand people say they will watch and my brilliant statisticians get it right.” I don’t even want to say, “You know what, forget Nielsen, I’ll go to Direct TV and watch what everyone watches … 30 Million people watch. I want to know what everyone watches.”

I want to say, “In my ideal world, sticking with the TV analogy, for a second … In my ideal world commercials becomes a modular time spot in a show and all the commercials that are relevant to me are computed by Hadoop and loaded into my DV-R and they just play when that slot appears.”

I know you are on Facebook looking at cameras that your friends are using. I know you’re on subaru.com looking at cars. I know you are on bestbuy.com considering different turntables and so I’m just going to play those commercials for you.

Stefan: Based on your retina [00:16:00] ID of re-targeting.

Ari: That’s okay but based on the account owner at least in the house and maybe the device owner like this is installed in my teenagers room but I can literally get -

Stefan: I was almost going there.

Ari: I can literally get down to a segment of one that’s ideal to me. First of all, a marketer will pay me way more for that eyeball than a generic eyeball.

Stefan: More targeting, right?

Ari: Yes. Secondly, it’s more relevant to me. I tend not to skip commercials if you get the segmentation right for me. I’m like, “Woah, what is that? I want to hear about that?” It’s possible nowadays to get that right.

The things that excite me are things that unlock the ability to look at volumes of … like groups of data in the billions range, so that we can start to find very discreet segments and target much, much better. In fact, my health care customers says, “You should work more with me than anyone else because we are doing good work. We are saving people’s lives.”

Stefan: For a lower price, I’m sure.

Ari: But then I turned it back on him in saying, “Actually, what you’re asking me to do is build the recommendation engine in health care. You’re asking me to tell doctors, people who have this disease should take that care path. People who take this medication should also take that medication.” It’s the exact same science as retail or travel.

Stefan: It’s just a little bit more complex to follow the different formats and the doctor knows safety for them.

Maybe to round this up what is … There’s obviously a lot of noise in the market still and people trying to get their feet wet, what’s the right approach to start with big data in general at Hortonworks? Where do I start? Obviously, maybe not the type of the data scientist that tries to make -

Ari: Everyone starts there though.

Stefan: Yeah, it’s kind of weird, right?

Ari: It is.

Stefan: It’s sounds interesting, “Science, ohhh.”

What’s … from your experience [00:18:00], what’s the first you would recommend?

Ari: There’s a forking to that answer, two pronged paths. If you’re a developer and want to pick up Hadoop, which a lot do … I’ve literally gotten emails from people who I used to work with saying, “I can add value to my resume if I know Hadoop.”

Stefan: Surprise.

Ari: The answer there is the Hortonworks Sandbox. It was literally a virtual machine image at Hortonworks.com. You go to the sandbox, Hortonworks.com/sandbox, click on that link, download it and you can start to do machine learning tutor … There’s a bunch of snapped-in tutorials, machine learning, basic SQL -

Stefan: Database is pre-installed?

Ari: PIG. Datameer can actually build a sandbox and we should do that together.

Stefan: We have that with you guys.

Ari: We do.

I thought we did.

People can actually wire up a Hadoop cluster, loads some … There is sample data into it … they could bring their own data into it. It’s going to run on their laptop or some kind of desktop class machine and then wire up things like Datameer and actually start visualizing and prove to themselves that A.) They can wrap their heads around this problem domain but B.) Instead of battling with their leadership that Hadoop is something we could be doing, they can show people real value.

All my customers where the developer has brought in Hadoop, where that developer has become a hero it’s because they actually went to EC2 or they went to our Sandbox, stood up a cluster, loaded some safe data into that cluster and showed some value before they called a bunch of people.

The other path though is the data lake paths. If you want to go at scale … you’ve already convinced yourself or you don’t care about the individual API’s programming, interface, user interface at a sandbox level, what we see people doing is basically saying, “Let me land a cluster with 10 to 100 nodes into a data center and let me pull some critical data sets onto that cluster and create a lake where people can come to that cluster [00:20:00] and start playing with that data sets.”

Either retain the data set that used to be too expensive to retain, retain it longer, retain a finer [inaudible 00:20:10] version of it. Sometimes you go from a no LTP data store which is detailed to a warehouse and lose all the detail as you do some kind of sampling and or some kind of process to lower the volume of the data. You may drop columns, most people tend to drop columns when that sample rose.

Here you say, “I have customers with like, 8,000 columns. Can you guys turn it into 15 columns?” So, load the 8,000 columns version in Hadoop and let people start exploring it. Hadoop has back up. Hadoop has retention. Hadoop has an archive. Then lets some scientists and analysts onto those data sets and typically go for the 360 degree view of the customer or the cross-channel analysis, what’s the customer doing across of my lines of business and find your highest dollar value customers. That is the lowest hanging fruit. It has turned us into the archives that can then feed the 360 analysis.

Stefan: That’s the most prominent used case we see all the time of our product.

Ari: Interesting. Good to know.

Stefan: Great. Thank you very much. That was really exciting. Thanks for the beer.

]]>I had a really interesting discussion with Hortonwork’s CTO Ari Zilka about Project Stinger, ORC files, YARN and Tez. The transcript of our talk is included below the video with a bunch of useful hyperlinks so you can go find out more.

Stefan: Can you introduce yourself? What’s your history? What are you doing beside drinking Czech-style beer?

Ari: Sure. Should I actually drink the beer?

Stefan: Yeah, please. I will have a Porter today.

[00:01:18] Ari: Okay. Anchor Steam, which interestingly enough is bottled over in South of Market in San Francisco about two blocks from the company I founded, in Terracotta.

Stefan: Cool. That’s a good intro into your history. What did you do before Hortonworks?

Ari: I’m CTO at Hortonworks now. I started out at Hortonworks as Chief Products Officer, transitioned into the CTO recently. Always did part of the CTO role. I don’t know what a Chief Products Officer is, but basically we used to divide CTO into outbound and inbound. I was outbound, focused on customers. Now, I’m both outbound and [00:02:00] inbound focused on product and roadmap and features.

Before Hortonworks, I did the CTO role for Terracotta, the exact same thing, outbound and inbound. When I say outbound, yes, I’ve done talks. I’ve stood alongside Rod Johnson at SpringOne and things like that and got the Duke’s Choice Awards from James Gosling at JavaOne for what the team did at Terracotta but more importantly is I spend most of my time with customers.

I’m always doing what, it’s not a word, but our CEO calls it “solution architecting”. Architecting is a not a word but I don’t know what else to call it, basically, solutioning with customers. Give me your problem domain, give me what you thought about, what you’ve researched so far, and we’ll go to a white board first. We’ll sketch it all out. We’ll deal with all of your corner cases, edge cases, complexities, volume, variety, velocity, even though I hate the three Vs and anything marketing spiel like that. Go through all of that stuff, nail it down then start building proofs of concept, project plans to prove to ourselves this architecture will work. I’ll checkpoint with you. That’s the outbound side of my role.

The inbound side of my role is come back into the organization, engineering and product management representing all that myriad set of use cases and say, “Hive needs to do this next. Nobody is using Pig. Everybody is using Pig.” Things like that.

Stefan: Well, cheers and congratulations on your new role.

Ari: Cheers, thank you.

Stefan: Well, title … Let me double-click on the history. Terracotta was a pretty cool back end for kind of a distributed environment in JavaWorld [00:03:44]. I happen to know.

Ari: Cool. Actually, as I go around at Hortonworks, I find most people know Terracotta. I just wish more people had written checks for the software.

Stefan: Let’s start a little bit there [00:04:00]. I think you were also a CTO at Walmart?

Ari: Chief Architect …

Stefan: Chief Architect.

Ari: … At Walmart.com.

Stefan: Okay. You’re in the Bay Area quite a while?

Ari: Yeah. I went to …

Stefan: Did you grow up here?

Ari: No, I went to Cal and never left.

Stefan: Well, it’s hard to leave, right?

Ari: Yes, very.

Stefan: What did Terracotta do before we jump into the Big Data? I mean this is Big Data, Terracotta, yeah?

Ari: Terracotta is big fast data is what we used to call it.

Stefan: Yeah, in distributed environment?

Ari: Well, like I don’t know if I’m allowed to say it but I’m not at Terracotta anymore so I don’t care but Paypal for example, paypal.com is powered by Terracotta. I think that’s 40 terabytes of purchase histories in-memory to figure out fraud detection. Without going into detail, it is big data. It’s big in memory data, that’s why their products now are called Big Memory.

Essentially what Terracotta does is a two-tier application level cache that a developer is in charge of. A developer uses it, wants objects from some data store and doesn’t really want to know when they’re getting them from the data store or when they’re getting them from local memory. Then they want to deal with the fact that their application is actually deployed to multiple instances. They don’t want to deal with consistency across threads and across, I call it space and time.

My data is changing. I need to deal with the freshness of that data/ correctness of that data and I need to deal with the latency of access to that data. I’ll basically stick a Terracotta server in front of my data store and then I’ll wire all my applications on to Terracotta. This is sort of a misnomer because you’ll read around the database and put data into Terracotta. You won’t read through Terracotta [00:06:00] but you get a cache here. You get a shared cache down here. This is actually scaled out, partitioned and replicated. Rate zero plus one in software, in memory, then it offloads this guy tremendously. So anything I changed here is visible here, is visible here. I have dials or a continuum to be able to set consistency levels, read consistency, read-write consistency, XA compliance and al that kind of stuff.

What this allowed the average application to do is to store terabytes in memory at 1 millisecond latency access time, worst case 10 milliseconds down to Terracotta.

Stefan: You already said objects, what made me really excited because as a big friend of SQL, obviously I did a bunch of Spring applications with Hibernate and EH Cash and those kinds of things. You guys are the distributed version of that but you always store Java objects?

Ari: Yes.

Stefan: Okay. How did you deal with the whole serialization? Did you do reflection on … I mean how did you compress the objects?

Ari: There are two incarnations of Terracotta. The first incarnation didn’t go and actually we disassembled applications at the byte code layer and found when byte codes were editing field level values, so we had a zero marshalling system. We weaved ourselves into an application and watched it make changes. You would grab a lock, meaning a synchronized barrier that would start a journal in your local thread. We’d keep track of everything you changed. When you release the lock, we’d flush the change. We were memory-model consistent but transparent with pseudo-no-marshalling.

Ari: Yeah. We knew it. We just thought our tools could help them find their thread-safety issues. It was just too hard for them to clean up their apps. We went to a straight [00:08:00] serialization on put, de-serialization on gap kind of model, with the caveat that you want to not use Java serialization. It’s space-inefficient. It’s time-inefficient.

Secondly, you had an opportunity to optimize that. You could store a deserialized cached form in the application. You didn’t deserialize every time you got. You deserialized the first time you got on this note, this note or that one.

Stefan: Did you have your own serialization interface then if you didn’t use a Java one?

Ari: No [00:08:34] …

Stefan: You overwrite the serializer?

Ari: Yeah.

Stefan: Yeah. I did this once for Hadoop. It wasn’t very popular because well back then, we discussed with Doug, “Should we have writables or should we have serializables?” I was a big fan of, “Hey, Java is serializable. We just have to overwrite the serializer.” Obviously if you just use Java serialization system, it’s incredibly slow, right?

Ari: Yeah, but it’s incredibly intuitive. The Java devs knows what to do with transients… [00:09:01].

Stefan: It is. Exactly. Yeah, and then we ended up with writables and then nobody ever … that was one of the biggest or still is the biggest problem in Hadoop. People are like, “Oh, I have that string-writable object. Let me put that into a local variable and get access to it a little later. And they’re like, “Why did the string change?” Because obviously a recyclable object all the time. Anyhow, but interesting. Good.

Ari: What’s interesting about all of this, it’s kind of funny because we were working on Project Stinger at Hortonworks and trying to make Hive much faster. One of the things we came across was the Java system class loader is god-awful slow. For the Hadoop core jars, they’re so big that it takes somewhere between half a second and two and a half seconds to just start up a JVM with Hadoop, proxy classes and everything wired in.

We had teams start to write out own class serializers and we called it [00:10:00] a Hadoop-shared object so we could pass classes around and shared them in a cluster, recycle them but load them faster into the system. We got down to like a 40 millisecond JVM startup time with the only catch that you had to override the system class loader which I told the team, “Hey. I’ve played this game before. [Crosstalk 00:10:18]”

Stefan: Yeah, was there, done that.

Ari: Yeah. No one wants to overload the system class loader.

Stefan: Yeah. It’s a little sketchy but on the other hand side, you have your Hadoop class and usually in that JVM don’t want anything else, right? But yeah, I had a lot of fun with class loading when I worked for JBOSS. Good old times, right?

Ari: Yeah.

Stefan: Overwriting and making sure that lock for J in different version, in different EGB jobs …

Ari: Jarjar.

Stefan: Yeah. All that good stuff. Great. Tell me a little bit about your work at Hortonworks, what’s the most exciting project that you guys are working on right now?

Ari: The most exciting stuff we’re working on right now is inside Project Stinger. There’s two exciting things. I’m going to erase this diagram or try to.

Stefan: It’s style if you have multiple layers.

Ari: Inside Project Stinger, there’s two really exciting things we’re doing. One is on the storage and access to data layer. It’s ORC files and vectorization. Super exciting and we’ll help anyone …

Stefan: That’s Owen’s child?

Ari: Actually, no…

Stefan: No?

Ari: He never wants to admit it but ORC is called Optimized RC, in reality, it’s Owen’s RC, but he’s not arrogant so he doesn’t buy that.

Stefan: No, he’s not. I know Owen since a while.

Ari: Yeah. On the storage site, it’s ORC files and vectorization on top of those ORC files. Then on the factoring side, on the architecture side, it’s projects Tez and YARN.

Stefan: Now Doug Cutting would come in and said, “Look, you know, I can do that much better with… [00:12:00].

Ari: Sure. At the end of the day, ORC is a format contributed by Microsoft’s super geniuses inside the PDW team. These are actually guys who have spent in some cases 30 plus years in data storage formats for relational workloads. ORC has some things that are just superior to anything else on the Hadoop platform right now.

Stefan: Like?

Ari: Like block level indexes and type-aware indexes. It’s a columnar format. I’m not going to bother to sketch up … maybe I should.

Stefan: Yeah.

Ari: But I mean if you have a block like this, you really want to store all column one values and then all column two values and then all column three values. Typically, you call this a columnar store. Some advantages we get and most columnar stores, all columnar stores do that. That’s not interesting.

Advantages you get is you can take column one and compress it because now you have more …

Stefan: Most likely better compression.

Ari: Most likely better compression because you have more consistency across value space.

Stefan: Luckily, very frequently Hadoop files are sorted, so even better compression.

Ari: Yeah, exactly. Like you take a web log and this would be IP address. This would be Access port, this would be browser type. Browser type is going to be what now? It’s going to be Chrome or what’s it’s called? Chrome or Safari or Internet Explorer and that’s it.

Stefan: No Internet Explorer anymore.

Ari: What we’re columnar and we can actually handle columns where the individual record sizes are quite large. We’re type-aware which means you tell us if this is an inch or some kind of numeric value or long. You tell us that this is string [00:14:00] and then we start doing things that are type-aware. We are sequel and Java-type compliant which is superior to anything else, but then we have an index as part of each block.

Stefan: Even now you get the block much faster?

Ari: Yeah.

Stefan: How big is the block size?

Ari: It’s configurable but a good block size is like a gigabyte.

Stefan: Okay. Well then it makes to have a indexing file.

Ari: For example a string index would be a dictionary. We dictionary encode strings and we write all the unique strings up in the index. We compress them down and then we write an integer look-up.

Stefan: Yeah, makes sense.

Ari: If you have things like URLs, they repeat a lot. The URLs are actually going to just be URL 1, URL 17, URL 22, URL 33. Integers we’re going to have min, max, average, things like that.

Stefan: Per block?

Ari: Per block. We’re going to have start date, end date.

Stefan: Okay. It makes it small and really fast.

Ari: We find it. We do it and then what you can do then is you can basically use the index to skip a block. When you’re doing filters and aggregations, you want to basically say, “There’s a where clause. There’s a query predicate. I can apply the query predicate to the index.”

Stefan: Right. You don’t even need to …

Ari: Right. I don’t even need to hydrate the rows or columns at all.

Stefan: That’s a slow pod, the deserialization, the inflection.

Ari: Yeah. This is row one, this is row two, this is row three. Vectorization is the classic Java loop with my connection, I get a result set. What result set dot has next, then resultset.field1.field2.field three next result. In a while loop, that while loop kills performance because you’re iterating through a result set and you’re paging data from various large ram pools into processor L1 cache.

What we’re doing in [00:16:00] vectorization is saying, “Leave this block, dehydrate it, flattened, unmarshalled as a block. You can look inside the index as much as you need to. When you do pass across the block, vectorize the query predicate.” Turn the query predicate into scalers and then pass it across this as a mask. You’re basically looking at a giant set of ones and zeros.

Stefan: You basically just overlaying …

Ari: You’re saying, “I’m looking for the following pattern, 1011.” It says, “There is 1011.” Then you say, “Oh. Well, I want 1011 this way.” It says, “Okay. I see here right here. It’s row 23.” Then you pull out row 23.

Stefan: Then you’ve even done these with [inaudible 00:16:44]?

Ari: I’m moving a gigabyte at a time through the data. Obviously L1 caches are on the order of megabytes. The move of a gigabyte into a megabyte L1 cache or a 16-mega L1 cache will take you hundred of clock cycles or a few thousand clock cycles which means it will be done under a second. We literally found a single laptop could manage a terabyte search in under two seconds with vectorization.

The ORC file is tied to the vectorization and the ORC file is tied to a lot of stuff we intend to provide to technologies like Datameer on top of us which is this global level index. If we take these indexes, you can use them now to show people ontologies about their data.

I have a column. I know its name. I know its type and I know its value range. I can bring the dictionary forward into data mirror. I can bring the integer min max data time-stamped values into datameer and that becomes dimension data or anthological data that’s very interesting to the end user. Even though the system doesn’t know what it is, it’s very interesting to the end user.

I can use to speed up searches because I could skip blocks. [00:18:00] Really what we’re talking about is take that index, centralize it to the entire table space or data set if you will and then proffer that up to anyone who wants it. Now you can build systems that actually service queries without ever looking at a record at all.

By the way, you paid no price to compute it except on ingest. You laid it out per block.

Stefan: That will be my question on how is that impacting right performance?

Ari: You’re packing a gigabyte block. You’re consuming some extra memory and you’re type-aware so you’re marshalling the types on write down into disk. Our write performance for an ORC file needs to get better but right now we have tuned it so it’s faster I think to write an ORC file than an RC file for example.

Stefan: That wasn’t great. It’s always a question of who you are performing for… [ 00:18:58].

Ari: Well, we’ve benchmarked ORC file writing against anything we can get our hands on. We find the best things out there. We’re competitive with. Well, I don’t even want to name specifics but I don’t see a problem. I see a problem in performance relative to an absolute number I’d like us to get to so we can write tens of megabytes a second across a cluster. I want to write hundreds of megabytes a second but no one is writing that fast right now.

Stefan: Let’s zoom a little bit out of this. This was really great and helpful. Let’s zoom a little bit out. That will be part of Stinger.

Ari: This is out in public GA.

Stefan: Okay. Good. Basically everybody can write against that then?

Ari: Yeah. In fact that’s where YARN and Tez come in. I’ll let you zoom out in a second but no one knows about ORC except us and our customers, our paid customers and our open-source followers of Hortonworks platform [00:20:00] as something that’s part of the Hadoop community solution space. What I’d like the whole world to do is since ORC is in the open domain, truly open, in Apache, gifted away, we need to build all the tooling around it. People need to be able to ingest into ORC which is not obvious.

Obviously I don’t want to write one record into ORC. I want to write a whole block into ORC. People need block writers into ORC that buffer up and guarantee delivery perhaps through Storm and Kafka or things like that but you need a buffer block writer for ORC. Then you need to be able to assemble tools that consume ORC data efficiently. Not just Hive itself but what if I want to write my own system that has nothing to do with sequel but still completely dependent on ORC, vectorization and block level index. I’d like to skip blocks and have query predicate pushed down, project my query on to my data as Datameer or as custom application. How do I do that? The answer is Tez and YARN and all these other stuff.

Stefan: We do a bunch of the indexing stuff. We already do on our system but not on that low level, right? Since we own data ingestion, we already do all this. I’m sure we have on the next version really cool stuff coming up.

Anyhow, well, cheers for that. Cheers on that.

Ari: Thank you.

Stefan: Let’s make a quick break and then we continue with the next session and talking a little bit more about the Hortonworks ecosystem and what else you guys are doing.