Stefan's Blog » Eric Baldeschwielerhttp://www.datameer.com/ceoblog
Big Data Musings From Datameer's CEOSat, 28 Feb 2015 00:17:42 +0000en-UShourly1http://wordpress.org/?v=4.1.1Big Data & Brews: What’s in Store for the Future of Big Data?http://www.datameer.com/ceoblog/big-data-brews-whats-store-future-big-data/
http://www.datameer.com/ceoblog/big-data-brews-whats-store-future-big-data/#commentsTue, 12 Aug 2014 18:36:54 +0000http://www.datameer.com/ceoblog/?p=606One of my favorite questions to ask my guests is what they think will be the next big technology in big data. These are some of the smartest minds in the industry and it’s always interesting to hear their thoughts on the topic. I thought it would be cool to compile a few snapshots of […]

]]>One of my favorite questions to ask my guests is what they think will be the next big technology in big data. These are some of the smartest minds in the industry and it’s always interesting to hear their thoughts on the topic.

I thought it would be cool to compile a few snapshots of our past guests, including Hortonworks founding CEO and CTO, Eric Baldeschwieler, data science consultant Antonio Piccolboni and Justin Borgman, CEO of Hadapt.

Tune in below and leave some comments to let me know your thoughts.

TRANSCRIPT:

Eric Baldeschweiller, Founding CEO & CTO, Hortonworks

Stefan: What’s the future? You know? Where do we go? I mean, you talked a little bit about Storm and Spark.

Eric: Mm-hmm (affirmative).

Stefan: Obviously, but where do you see Hadoop maybe five years from now even?

Eric: Well, there’s kind of two … I’m spending a lot of time right now thinking about the future of Hadoop, and there’s two megatrends that I’m really noodling on. There’s a whole list of features that I could give you …

Stefan: Yeah.

Eric: … but that’s probably another talk. One megatrend is how are Cloud and Hadoop going to converge? I think that’s … there’s a 20-minute segment right there.

Stefan: Yeah.

Eric: I think that’s really interesting. If you look at it, Amazon and Google are two mature proprietary systems that show the two ways it could go. Amazon is a Cloud first, and people are having a lot of success running Hadoop on it. Google built an HPC infrastructure with a real focus on supporting things like MapReduce and that had a HDFS-like storage infrastructure first, and now they do Cloud-like things on top of it, right? They run all their services in, effectively, a Hadoop-like system. Or in at least an Hpc.Scheduler-like system.

So, how are these OpenStack, or how are these Open Source ecosystems going to converge OpenStack Hadoop, and all of the various projects in there? I think that’s really wide open.

Stefan: Mm-hmm (affirmative).

Eric: Right? I mean, right now neither project does what the other set of projects need, but IT managers don’t want both.

Stefan: Yeah.

Eric: Right? They want one common place to store all the data, and one common way to compute all the data. One common way to allocate resources to projects.

Stefan: Right. They want a plug in the wall, where they just put in … this is my storage and computer and its utility.

Eric: Exactly. So they think that that thing is going to be called [00:02:00] OpenStack, but Hadoop is actually getting deployed in a lot more places and at a lot more scale.

Stefan: Than OpenStack.

Eric: Than OpenStack, so how’s that story going to end?

Stefan: Right.

Eric: I have no idea.

Stefan: Yeah.

Eric: There’s a lot of speculation you can do there. The other real megatrend is when we started Hortonworks, we talked about how important it was that the community not fragment. That there be one distribution of Hadoop.

Stefan: Yeah.

Eric: That’s a noble goal, but someone was following me around at a conference the other day and saying, “Admit it! Hadoop, the Hadoop community is fragmented. The Hadoop community is fragmented.” We got into this long argument and ultimately I said, “Well, so what?”

Stefan: Yeah.

Eric: Right? I think, yes, in some ways the Hadoop community, we can argue about how much it’s this way, and how long it’s going to last, but I think the Hadoop community is kind of going into a Unix decade.

Stefan: Yeah.

Eric: If you look at the Unix ecosystem, the Unix APIs came out pretty early. There was the AT&T Unix version and then there was the Berkeley Unix version, and then there was every vendor’s Unix version, and one can argue that this was a terrible thing. That Unix evolved much more slowly than it might have if there had been one.

Stefan: Right. Well, it’s an evolution.

Eric: Yeah, you can argue that, too, and that everybody was slowed down because, as a vendor, if you wanted to write an application for Unix, you had to write it for everyone. You could look at it that way and you could look at the CQuEL ecosystem and say the same thing. Wouldn’t it be terrific if all the CQuELs where the same because then all the people that write CQuEL apps would have less work to do?

Or, you could turn around and say, “Well, wait a second, look at those huge ecosystems, right?” If you look at the Unix ecosystem, Unix went from an unknown thing to the default …

Stefan: Multi-billion market and, you know, a lot of technology and innovation are in different areas. Eric: … and the defaults are the ecosystems on which the systems’ infrastructures are built during that “Unix decade.” Stefan: Right.

Eric: So I think Hadoop’s going to see the same thing. I don’t know. I’m, of course, a big fan of Apache Hadoop and hope that everybody does continue to base all of their work on that, but whether or not they do, the APIs of Hadoop are being supported by more and more vendors, and more and more products, and more and more distras, be they pure or not pure, all the time and, as a result, I think what’s really interesting, over the next few years, is what are people going to do with Hadoop?

Stefan: Right.

Eric: Right? What is that ecosystem that’s forming above Hadoop? If that does really well, that just drives more of all the Hadoops, and that creates more and more opportunity.

Stefan: Great.

Eric: So yeah, that’s very exciting to watch and see.

+++++++++++

Antonio Piccolboni, Data Science Consultant

Stefan: So outside of Rhadoop, what are other open source projects maybe as a final question, what are other open source projects that you see that you’re really excited about or that you’re maybe consulted in corporate that you think have a great future?

Antonio: I think Javascript has a great future.

Stefan: Yeah? Antonio: I think when I… I got a..

Stefan: Did you look into Reno and those kind of platforms that run javascript server side? You know Reno? It’s a funda… what?

Antonio: I looked into node-

Stefan: Okay, node.js, right? Yes.

Antonio: I think Reno’s something [I hate too 00:18:57] much to do this, but I don’t think I have time to… yeah… to…

Stefan: We’re using this actually to push some javascript code down back to the server. It’s kind of similar to what node.js and those kind of things do. But yeah, so we basically run Dtree, Visualization on the serverside with Reno and Reno’s basically a Mozilla… I think a Mozilla project.

Antonio: Yeah. If you look at the trajectory, Java started on appliances, I mean before it was a language for applets. Stefan: Mmhmm (affirmative).

Antonio: Then I don’t know why they moved it… they needed a language for applets at Sun. It became a language for applets. It completely, totally bombed it, failed it, drastic. I don’t think I have seen an applet in five years or something.

Stefan: Oh, oh.

Antonio: But then it became a server-side and now there’s all Hadoop… Phoenix is based on Java. I think javascript is following a similar trajectory. Of course, I don’t have the crystal-ball but if those are getting better-

Stefan: But you have R.

Antonio: I can try to mould it.

Stefan: Oh yeah.

Antonio: Two examples. We’ll see. It’s already moving on the server-side. It has interesting tools. It has a lot of people working on the interpreters that you can write crappy code and it’s gonna run fast anyway. So I think…

Stefan: It has a lot of investment put into javascript optimization by companies like Google and Apple and…

Antonio: I think some of the brightest minds in languages these days are employed in one of those three groups that’re doing javascript. This [confidence 00:20:28] like matter markets where… that are based on javascript. They’re up-front about it. “We run javascript on server and client side because it’s simpler.”

Stefan: Yeah, that makes sense.

Antonio: It is fast enough for us.

+++++++++++

Justin Borgman, Hadapt

Stefan: Anyhow, Justin, what’s coming next in the market? What do you think? What’s the future in big data? Obviously it’s SQL, sounds like.

Justin: Yes.

Stefan: But what do you think are some of the trends that you’re keeping a close eye on? Is it in memory stuff? Do we need to do more on the file system area? Is it more kind of the access? What do you think’s the next thing?

Justin: Yeah, great question. I think, certainly, all of those are interesting areas, and memory certainly has its place. I think some of the things that we’re most interested in keeping an eye on is this notion of you sort of have Hadoop. Again, I’ll draw. My bad handwriting.

Stefan: Oh, you didn’t see mine yet.

Justin: You’ve got HDFS in here, and very often people are using this as a landing environment, a data reservoir, a data pick-your-favorite-word, right?

Stefan: Yeah.

Justin: Ultimately, they’re doing some ETL, and they use Hadoop as effectively fancy ETL, and then they push it into a database, a traditional database, maybe that’s Teradata or what-have-you. We’re constantly focused on increasingly what is driving people to skip this step and leave that data in here, and what’s missing to prevent that, because that’s the ultimate feature we believe in. Certainly, we founded the company on that, of doing all of your analytics in one place in this data reservoir.

Some interesting things there we think are making ETL a thing of the past, to a certain degree.

Stefan: Oh, we’re on the same page here.

Justin: Yeah, exactly. That’s one area we keep an eye on things that we invest in from an IP perspective to try to make that easier, make that more accessible. Certainly, our vision is effectively bringing database technology into Hadoop. That’s sort of what Hadapt is all about. Continuing to watch that, also watching what the rest of the ecosystem is doing from a maturity around resource management with YARN, but also security.

Increasingly, as Hadoop goes production in major enterprise customers, there are the kinds of things that aren’t always the sexy things, that engineers are like, “Oh, this is great,” but you have to build it anyways. It’s also something we’ve been paid to do.

Stefan: Yeah, go to?

Justin: We see a lot.

Stefan: Yeah. Are there, to come back to that observation, [00:08:00], because we see that too, so that Hadoop comes to melting pot for all the data, but then if it’s valuable inside, people are still pushing into databases. Why do you think this is there because you have a BI system sitting on top of the database? It’s working well, that’s maybe overly integrated, or are people just feeling more comfortable having data clean and structured?

Because you could argue, “Hey, I just land all the data in there, and instead of cleaning it and transforming the data, I just do a view of the data, kind of the view concept from databases where my view is the cleaning of the data.” The advantage will be if I change my mind what clean data means, I would just change the field, right? Aren’t we at the point that we have enough storage and compute to handle everything as a field?

Justin: Right. I certainly believe that’s coming, and hopefully not too far away, but I think the challenge, still, for a lot of people is that these existing legacy platforms have so much functionality already built into them that Hadoop hasn’t been able to duplicate, yet. To your point, BI tools is one common way that people want to interact with that data, and these tools don’t work perfectly seamlessly with Hadoop yet today. Things like doing deletes or updates, for that matter. Stefan: And you guys do that, deletes and updates?

Justin: Not yet, but very shortly. That’s something we’re working on, and again something that we think we’ll be able to implement much more quickly, given our DBMS component of our architecture than open source vendors that are trying to reinvent the wheel, to your point earlier.

]]>http://www.datameer.com/ceoblog/big-data-brews-whats-store-future-big-data/feed/0Bonus #BigDataBrews: Eric Baldeschwieler Talks About The Future of Hadoophttp://www.datameer.com/ceoblog/bonus-bigdatabrews-eric-baldeschwieler-talks-about-the-future-of-hadoop/
http://www.datameer.com/ceoblog/bonus-bigdatabrews-eric-baldeschwieler-talks-about-the-future-of-hadoop/#commentsTue, 18 Feb 2014 19:41:50 +0000http://www.datameer.com/ceoblog/?p=263We usually just do two 20-minute episodes of Big Data & Brews, but the conversation was so interesting with Eric that I wanted to be sure to share this too. Here are two “mega-trends” Eric said he’s paying attention to when it comes to Hadoop: TRANSCRIPT: Stefan: What’s the future? You know? Where do we […]

]]>We usually just do two 20-minute episodes of Big Data & Brews, but the conversation was so interesting with Eric that I wanted to be sure to share this too. Here are two “mega-trends” Eric said he’s paying attention to when it comes to Hadoop:

TRANSCRIPT:

Stefan: What’s the future? You know? Where do we go? I mean, you talked a little bit about Storm and Spark.

Eric: Mm-hmm (affirmative).

Stefan: Obviously, but where do you see Hadoop maybe five years from now even?

Eric: Well, there’s kind of two … I’m spending a lot of time right now thinking about the future of Hadoop, and there’s two megatrends that I’m really noodling on. There’s a whole list of features that I could give you …

Stefan: Yeah.

Eric: … but that’s probably another talk. One megatrend is how are Cloud and Hadoop going to converge? I think that’s … there’s a 20-minute segment right there.

Stefan: Yeah.

Eric: I think that’s really interesting. If you look at it, Amazon and Google are two mature proprietary systems that show the two ways it could go. Amazon is a Cloud first, and people are having a lot of success running Hadoop on it. Google built an HPC infrastructure with a real focus on supporting things like MapReduce and that had a HDFS-like storage infrastructure first, and now they do Cloud-like things on top of it, right? They run all their services in, effectively, a Hadoop-like system. Or in at least an Hpc.Scheduler-like system.

So, how are these OpenStack, or how are these Open Source ecosystems going to converge OpenStack Hadoop, and all of the various projects in there? I think that’s really wide open.

Stefan: Mm-hmm (affirmative).

Eric: Right? I mean, right now neither project does what the other set of projects need, but IT managers don’t want both.

Stefan: Yeah.

Eric: Right? They want one common place to store all the data, and one common way to compute all the data. One common way to allocate resources to projects.

Stefan: Right. They want a plug in the wall, where they just put in … this is my storage and computer and its utility.

Eric: Exactly. So they think that that thing is going to be called [00:02:00] OpenStack, but Hadoop is actually getting deployed in a lot more places and at a lot more scale.

Stefan: Than OpenStack.

Eric: Than OpenStack, so how’s that story going to end?

Stefan: Right.

Eric: I have no idea.

Stefan: Yeah.

Eric: There’s a lot of speculation you can do there. The other real megatrend is when we started Hortonworks, we talked about how important it was that the community not fragment. That there be one distribution of Hadoop.

Stefan: Yeah.

Eric: That’s a noble goal, but someone was following me around at a conference the other day and saying, “Admit it! Hadoop, the Hadoop community is fragmented. The Hadoop community is fragmented.” We got into this long argument and ultimately I said, “Well, so what?”

Stefan: Yeah.

Eric: Right? I think, yes, in some ways the Hadoop community, we can argue about how much it’s this way, and how long it’s going to last, but I think the Hadoop community is kind of going into a Unix decade.

Stefan: Yeah.

Eric: If you look at the Unix ecosystem, the Unix APIs came out pretty early. There was the AT&T Unix version and then there was the Berkeley Unix version, and then there was every vendor’s Unix version, and one can argue that this was a terrible thing. That Unix evolved much more slowly than it might have if there had been one.

Stefan: Right. Well, it’s an evolution.

Eric: Yeah, you can argue that, too, and that everybody was slowed down because, as a vendor, if you wanted to write an application for Unix, you had to write it for everyone. You could look at it that way and you could look at the SQL ecosystem and say the same thing. Wouldn’t it be terrific if all the SQLs where the same because then all the people that write SQL apps would have less work to do?

Or, you could turn around and say, “Well, wait a second, look at those huge ecosystems, right?” If you look at the Unix ecosystem, Unix went from an unknown thing to the default …

Stefan: Multi-billion market [00:04:00] and, you know, a lot of technology and innovation are in different areas.

Eric: … and the defaults are the ecosystems on which the systems’ infrastructures are built during that “Unix decade.”

Stefan: Right.

Eric: So I think Hadoop’s going to see the same thing. I don’t know. I’m, of course, a big fan of Apache Hadoop and hope that everybody does continue to base all of their work on that, but whether or not they do, the APIs of Hadoop are being supported by more and more vendors, and more and more products, and more and more distros, be they pure or not pure, all the time and, as a result, I think what’s really interesting, over the next few years, is what are people going to do with Hadoop?

Stefan: Right.

Eric: Right? What is that ecosystem that’s forming above Hadoop? If that does really well, that just drives more of all the Hadoops, and that creates more and more opportunity.

]]>http://www.datameer.com/ceoblog/bonus-bigdatabrews-eric-baldeschwieler-talks-about-the-future-of-hadoop/feed/0Big Data & Brews: Eric Baldeschwieler on the History of Hadoop & The Beginnings of YARNhttp://www.datameer.com/ceoblog/big-data-brews-eric-baldeschwieler-on-the-history-of-hadoop-the-beginnings-of-yarn/
http://www.datameer.com/ceoblog/big-data-brews-eric-baldeschwieler-on-the-history-of-hadoop-the-beginnings-of-yarn/#commentsTue, 11 Feb 2014 19:22:23 +0000http://www.datameer.com/ceoblog/?p=258Given its opening day at O’Reilly’s Strata Conference, today seems an appropriate day to share this second part of my discussion with Eric Baldeschwieler who shares the rest of his story of how Hadoop came to be within Yahoo. TRANSCRIPT Stefan: I really have a tough job. Sitting here, drinking… Eric: It’s good to bring […]

]]>Given its opening day at O’Reilly’s Strata Conference, today seems an appropriate day to share this second part of my discussion with Eric Baldeschwieler who shares the rest of his story of how Hadoop came to be within Yahoo.

TRANSCRIPT

Stefan: I really have a tough job. Sitting here, drinking…

Eric: It’s good to bring your passions together.

Stefan: Yeah. Good. Absolutely.

Welcome back to Big Data & Brews, with Eric14.

Eric: Hello again.

Stefan: As promised, we had a few more beers and we shared a few more laughs here … well, a few more sips. We stopped a little bit where the history of Hadoop got really interesting. You said your team worked on your own system, and then you kind of got convinced or talked into adopting that very, very early version of Hadoop that run on the Apache license. Let’s double click little bit on this. I’m certainly curious about all the conversations and discussions that you guys had. Intel was all C++ based, right?

Eric: Absolutely.

Stefan: You guys all hardcore, low-level, deep down there. Now there’s those guys coming along, saying, “Hey, why don’t we …” or maybe those hippies, the Steve Altman hippies … “Hey, why don’t we do Java? Everything is good, we have garbage collection.” That’s one of the never-ending stories in conversations and fights in the Hadoop land. It’s like, “Why aren’t you doing C++?” What’s maybe your experience with that as well?

Eric: Yeah, sure. Everybody on the original team at Yahoo was a C++ coder, so it wasn’t a decision that we took without some consideration.

Stefan: (laughs) They had to learn a new programming language, basically?

Eric: They did. The day I came into the staff meeting and said, “Guys, Rami and I have been talking about it, and he’s convinced me. We’re going to go with Doug Cutting’s project. We’re going to take all of our learnings and bring it into what was Nutch. [00:02:00] We’re going to create a new project, he’s going to spit it out, and we’re going to just commit to that.”

Stefan: Did they throw things?

Eric: Very long faces. I would say it took them about six months before people started to see that this had been a good decision. For about six months, I was the least popular person on the floor.

But yeah, so why Java? First off, it was a bit coincidental. The thing that really mattered to us in the short term was that we were adopting an existing project. That just completely changed the dynamic of the internal conversation about whether our company could contribute to open source. Because if you’ve ever tried to convince a company to make a major investment in an open-source project, it’s hard.

I’ve watched a lot of companies that have made the decision that they’re going to compete in the Hadoop ecosystem … I guess we should have stopped at five beers. A number of companies that have decided they’re going to compete in the Hadoop ecosystem … not just use it, but actually have Hadoop-based products … haven’t yet figured out how to contribute to an open-source project. So for a company whose business was something else entirely, it was a long, drawn-out process.

Stefan: I think I remember Doug Cutting had to actually commit all the patches you guys did for the first, what, year or something?

Eric: It took us 18 months to really get to a fully normalized situation. For the first 18 months, Doug was doing basically all the commits. Then we got Owen as the second committer, and they shared it. That was a full-time job for them during that period. So yeah, that was pretty nutty. But in the end, we got there.

I didn’t appreciate it at the time, but it was a really revolutionary decision that Yahoo did that, so I give those guys a lot of [00:04:00] kudos for backing us. The open source decision was really hard, so adopting an external project meant that it’s a much easier decision to say we’re going to improve something that exists, versus to say we’re going to take our own artifact and move it into open source. It’s just much harder for a company to do that. As our initial project, it made it much easier. I think we’d still be debating with the legal team-

Stefan: (laughs)

Eric: … if we’d built Juggernaut to completion and then tried to open-source it. We wouldn’t have succeeded yet. But that’s legal.

In terms of Java, part of it is, given that what exists was in Java, we kind of inherited that. In retrospect, did it make sense? I think it made sense for a number of reasons. The first is just that Java is so much more productive. There’s so much better tooling, in terms of debuggers. You don’t have the garbage collection issues. You just have far fewer bugs to begin with.

Then you’ve got all the free tooling to do all kinds of code analysis, all kinds of things. Yes, similar tools exist for C, but they’re just … All the new academic work happens in Java because it’s so much easier to do academic work in. It’s just a much richer set of tools. It’s a much easier language in which to write correct code, which meant that the first version of Hadoop was correct much sooner.

That’s not to be ignored, because in this game, agility matters. Getting it right, and getting it working, and getting it out into people’s hands is a huge piece of what’s needed to make open source succeed. The fact that Hadoop was working, and succeeding, and visibly improving mattered a heck of a lot more than its ultimate performance in its first … even today, but certainly in its first few years of life.

I think Java [00:06:00] really bootstrapped that. It gave us a lot of leverage in terms of the amount of correct working code we could get in people’s hands quickly. The fact that it was slower than it might have been if it was handcrafted C++ code didn’t matter as much that it was there and it worked today. That factor can’t be ignored, because in an open source project, when you always have new people coming in, the learning curve and the code-correctness curve issue never goes away.

If you’re going to have a project with hundreds of contributors that’s doing complicated stuff, Java’s a real advantage. Just code transparency and all those analytics tools and everything else, we really leveraged all that.

Then you get to the argument about whether … that doesn’t matter. Ultimately, C++ could be better. There’s people that’ll stand up and say, “That’s just not right. Java can produce as good code as C++.” There’s a case. We’ve been doing a bunch of work on Stinger, Hive, where I’ve looked at one part. The Impala crowd is doing LLVM, really tight C++ code, and the other crowd is doing vectorized code and letting the git do its thing. Really …

Stefan: Same thing.

Eric: It’s the same thing. The performance difference isn’t that noticeable. Java has one real Achilles heal, which is that it’s much harder to do dense memory management. HBasic, for example, or the NameNode, is a place where the amount of data you can put in memory really matters. Then you start having to invest a lot more to use memory effectively in Java. That kind of obviates a lot of the advantages of Java. But that’s kind of third order. First, you need to get there. [00:08:00] Then, at some point it becomes as complex as C++, but it’s still not worse than C++.

I think in the context of trying to build a big open-source ecosystem, I think Java has been a huge advantage for Hadoop, and all the people that have leapt up and … Every year, there’s been a challenger to Hadoop, and every year, these things have not proven relevant. I’m pretty bullish. I actually think a lot of folks, they have reasons why … Over time, will pieces of the Hadoop ecosystem be recoded in C? Maybe, sure. There’s things in JNI in the main Hadoop code line now, and there will probably be more in the future.

Stefan: Compression, for example, right?

Eric: Yeah, compression, that’s an easy one. System calls in Java are terrible.

Stefan: Yeah. I remember the first version of Hadoop where discrete was hard-coded, a DF. That was one of the main reasons why Hadoop worked awesome on Linux machines. As soon you run it in Windows, if you call DF, it’s of course a completely different command. The way it was implemented, it called DF and then it parsed results. If you actually go on a different Unix version where DF had a different syntax, you’re screwed.

The first time I saw that implementation, I’m like, “Wow. You’re really just executing a command and parsing the text that’s coming back to make a decision like how much space can I use of this hard drive?” Obviously, things got better over time.

Eric: Yeah, and I think Hadoop is getting to the point where in the future, it’s going to start driving its requirements into Java. That’ll be fun to see how those things co-evolve.

No, I mean, there are limits to Java. I wouldn’t argue that it’s the right thing [00:10:00] for everything. But I would argue that agility really matters in this game because, yes, you can statically code what’s in Hadoop today better in C++, but what you want to be doing is not recoding the inner loop for a 2X performance game. What you want to be doing is taking all your learning from the last n years and building a new algorithm, new data structure, new approach that’s qualitatively better. Java’s a better language to do that next version in.

I think Java’s going to continue to be the major language of Hadoop, or at least the JVM will be the major platform for Hadoop for the immediate future.

Stefan: As you started building Hadoop then with your team, how was the adoption path within Yahoo, and what kind of services and projects moved on there? Was it people that were more rightadoo on it, and some people that came later in the game? I hear there was like a camp that really liked a database that starts with O, and then there was you guys, and …

Eric: No, not really an Oracle camp. There’s … I mean, how many 20-minute segments do we have to talk about this?

Stefan: (laughs) Well, we’ll make more. That’s for sure already.

Eric: Adoption was interesting. A company like Yahoo, there’s so many different organizations and teams doing so many different things. The first year or two of Hadoop, when we were just getting off the ground, we kept discovering other projects that asserted that they were also solving similar problems. They wanted to figure out why we shouldn’t abandon Hadoop and adopt their C++ version.

Stefan: (laughs) Of course.

Eric: That they were using to manage data processing for their AD log pipeline, or something like that. So for the first year [00:12:00], it was really just focusing down and saying, “Look, none of these other things that we’re discovering internally solve the search problem. Our goal is to write something that can work at internet scale. There doesn’t exist such a thing. That’s why we’re building this.” We just had to sort of stay focused.

The next thing that happened was those guys that I told you had been coming into my office and saying, “I want to do more research. You’ve got the crawl data. You’ve got the search logs. If I could get that, I could do better science and I could make more money for Yahoo.”

We kind of got to the point where we could make Hadoop work when we started … when Doug took it out of Nutch and put it into the Hadoop project … to working about 20 nodes. We got to the point where it was working on about 100 nodes. Then we realized, to get it to work at the point where it could work on 1,000 nodes was going to take us another 18 months or so.

With all these other parties saying that they had competing systems … It’s hard on a company to not have a product for two years. So we said, “Let’s put out a Hadoop cluster for the science teams. That’ll be a great proof of concept. It’ll show the rest of Yahoo that we’re building something valuable. That’ll keep our managers happy, guarantee funding. Good thing.” (laughing) We weren’t thinking of it as our mission. We were thinking of it as a way to stop Arcotti from coming into my office every month and demanding …

Stefan: Demanding results.

Eric: Yeah, and give the science team something to work with. But mainly, it was just to show that we were producing value, because our goal was to rebuild the web crawl infrastructure with it. But that was, like I said, just a long ways out.

So we put it out in the science hands. What we expected was they’d get some interesting, basically, research results. Maybe they would get data results, where they would [00:14:00] work on the data, and come up with a new spelling correction dictionary, which would then get put into a production system.

But what happened was, they did all that. They were very excited, because basically we blew their minds. The productivity went up orders of magnitude because instead of spending their day trying to find data around Yahoo and then figure out what subset they could get on whatever storage they had … Half their time was spent doing IT work, basically, finding and moving data and not doing research. When they did the research, they would always have to do it on a tiny subset of the data. Now they could just have all the data, move it once to one place, share it, and do their work. They were just much more work productive.

Because they had so much more compute resource, they could write their code in Java and be much more productive. There was this explosion of research results. More importantly, they started to build prototypes of production systems. They said, “Look, we want to take the AD logs and process them every 15 minutes to come up with a model of what people are interested in, and put that back into production every 15 minutes. If we do that, Yahoo will make more money.”

That changed the game. All the sudden, this created this completely unanticipated virtuous cycle where teams were demanding that we build and support Hadoop clusters for them, so that they could build production applications, not in search at all. It just started to grow. I found myself running a Hadoop service with ultimately thousands of customers inside Yahoo and 40,000 nodes.

But yeah, early days, it was not … People had to think differently to use it. There’s this guy Larry Heck who ran the search and advertising science team, who saw what it could do. Originally, there was this guy Arcotti and a couple of other scientists who used it and got good results. Larry saw this and said [00:16:00], “I’m going to make everyone on the science team use Hadoop.” We started to maintain metrics, of just how many people on his team, what percentage of every subgroup was using Hadoop. He started taking that to his staff meeting and asking people, “Why aren’t you using Hadoop?” This just caused it to explode.

Yeah, it was this huge unanticipated success with the science teams doing … We thought of data science as something that was going to just produce, as a side project. Ultimately, it became a pipeline of new applications. It drove most of the use of Hadoop in Yahoo.

Stefan: What was, over that period of time … I’m sure it changed, but just your first thought, [00:16:43] … what was your most requested feature at that period of time?

Eric: Oh gosh. More data? (laughs) It was mostly just more scale, more data, and then obviously more speed. The focus was never on more features. It was really always on just more stability, more scale, and more performance. Really, when our first Hadoop t-shirts from the summits had the Yahoo scaling Hadoop, because that was it. It wasn’t about new APIs. I think the APIs were satisficing for a lot of work in the early days. I give the Google guys a lot of credit for that. Doug built something that worked, in terms of the APIs.

It was really much more of that just taking everything we’d learned and building internet-scale systems. We improved the performance orders of magnitude. We improved the scale orders of magnitude. That’s a lot of engineering, a lot of learning.

Stefan: Let’s switch gears here a little bit. What are the really cool things going on right now, in today’s world? Are you really excited the Hadoop ecosystem, or what’s going on in the Hadoop backend?

Eric: There’s a tremendous amount of stuff going on, [00:18:00] of course. Obviously, the transition to YARN is very exciting. That has been a journey we started … I’m afraid to guess when we really started that. 2009, 2010?

Stefan: Can you explain this a little bit more, for the folks?

Eric: Sure. The original Hadoop version …

Stefan: Here, if you want to …

Eric: (laughs) I don’t think there’s a lot of drawing. There’s a lot better diagrams you’ll find on the web. But the original Hadoop version is built with basically two systems. I haven’t used a chalkboard since I TA’d.

Stefan: Isn’t it fun? (laughing)

Eric: Yeah, it’s taking me back to ’95. HDFS and MapReduce are the two basic layers of Hadoop. Each one of these runs, basically you have a [inaudible 00:18:55] node. HDFS handles your storage, MapReduce handles your compute layer.

The problem with the MapReduce layer is that it assumes one model of programming. It’s a very powerful model of programming. Look at all the things people have done with Hadoop. But there are of course many other possible ways that you could approach distributed computation or just cluster-sharing.

A lot of what people do with Hadoop is actually very simple things that you could do on cloud. You just need to launch a process and do some computation on the data. MapReduce wasn’t designed for that. It certainly wasn’t designed for running all these new frameworks that are emerging. Even worse than that, because every time you change MapReduce you have all the daemons across all the nodes in your cluster, even just evolving MapReduce has been relatively slow, because you have to do it carefully.

We had this thought, which was, what if we break it into three layers, basically? [00:20:00] HDFS, YARN … which technically stands for Yet Another Resource Negotiator, but really we just wanted a name with a “y” in it so that people knew it came from Yahoo (laughing) … and then we could have multiple frameworks on top. You can have MapReduce. You could have, let’s say … MPI was one we thought about a lot in the early days, although that hasn’t been realized yet. You could have lots of others.

Today, draw the list. Storm is really interesting, Spark is coming along. Both of those are happening at Yahoo today. People are looking at web app containers like Tomcat and figuring out how to run that in the cluster. Just all kinds of long-running services, and HBase, that’s a fun one to run in the cluster.

Anyway, the idea is lots and lots of different frameworks. Every user can choose a different kind of compute model. MapReduce basically was doing two things. It was doing resource management choosing …

Stefan: Right, the hard-coded model, for its own purpose.

Eric: Right, exactly. The resource management is deciding what resources you get on what nodes. And then the actual logic of the MapReduce: where does the user code run, what happens when a node fails, et cetera. So we broke that up into call it a user component and a system component. One of the reasons we did this was not even so that you could run lots of frameworks. It’s so that you could evolve MapReduce more quickly.

Stefan: And you have maybe multiple versions. It’s almost kind of a class load or virtualization framework.

Eric: Exactly. Now, if you believe you can improve MapReduce, that just becomes something that you can run a test version of from your desktop and see if [00:22:00] the new one or the old one were better. Whereas before, you had to go and change the clusters, schedule downtime, lots of pain. It will make all the existing work much more agile. It’ll also create hopefully just untap huge amount of innovation. Watching that innovation is going to be really interesting over the next few years.

Stefan: Eric, thank you very much for joining me for a drink here at Big Data & Brews. Hope to see you back soon.

]]>http://www.datameer.com/ceoblog/big-data-brews-eric-baldeschwieler-on-the-history-of-hadoop-the-beginnings-of-yarn/feed/0Big Data & Brews: Eric Baldeschwieler on the History of Hadoophttp://www.datameer.com/ceoblog/big-data-brews-eric-baldeschwieler-on-the-history-of-hadoop/
http://www.datameer.com/ceoblog/big-data-brews-eric-baldeschwieler-on-the-history-of-hadoop/#commentsTue, 04 Feb 2014 20:33:42 +0000http://www.datameer.com/ceoblog/?p=253Eric Baldeschwieler is an influential figure in the Big Data and Hadoop community. I was honored to have him in to chat to hear his view on the history of Hadoop. TRANSCRIPT STEFAN: Welcome to Big Data and Brews, today with E-14 [00:00:10], Eric Baldeschwieler. Welcome. ERIC: Thank you. STEFAN: Can you introduce […]

STEFAN: Can you introduce yourself and- nice, cold drink you brought with you.

ERIC: Sure. I’m Eric Baldeschwieler. I’ve been working with Hadoop since it’s inception. Before that I was building search engines for Yahoo and Inktomi, so I’ve been working with big data since ’96, by some reckoning.

STEFAN: So was it Inktomi before… and they joined the team after acquisition.

ERIC: That’s right, Inktomi was acquired by Yahoo in 2003. And the beer! This was in my fridge.

STEFAN: We’re both German, so…

ERIC: That’s right, there’s that cultural heritage here. This is actually a California beer. There’s a lot of great beer in California.

STEFAN: I do agree.

ERIC: This one is from… I’m having a senior moment. It’s from the south of us near San Luis Obispo, Paso Robles.

STEFAN: So, a microbrewery?

ERIC: I found it in my fridge.

STEFAN: Nice.

ERIC: I actually drink it fairly regularly, obviously.

STEFAN: Okay. Well, then let’s do it. So, the first question that I have, when we hang out somewhere at one of those events [00:01:41], “Why E-14?” What’s the background there? That was your e-mail address at Yahoo, I assume?

ERIC: Eric-14 is a label that’s been with me for a long time, and actually goes all the way back to my sister’s grade-school, where there were two Karen B.’s. There was a Karen-6 [00:02:00] and a Karen-14, because Baldeschwieler has 14 letters in it.

STEFAN: Okay.

ERIC: So when I had to choose an e-mail address in college, it seemed the obvious choice.

STEFAN: It’s very obvious. Let’s see, so that’s two different kinds? I have a double-barrel ale…

STEFAN: Cheers! [00:02:36]. Yeah, that’s a good, strong beer. It’s not one of those American beers that comes in a can and is brewed from, maybe corn or something.

ERIC: I have this hypothesis that they never stopped brewing beer in California during Prohibition.

STEFAN: Yes?

ERIC: If you go out of the cities there’s all these really good microbreweries, so I think they just never stopped.

STEFAN: Yeah, that makes sense. So, help me a little bit with your history. What did you do, maybe before Inktomi and how did your career go there [00:03:20] and, I guess, most famously to run the whole Hadoop team and really build this up there. So, help us to get there, from a technology perspective.

ERIC: All right. I came to the Valley… God, ’87, from Carnegie-Mellon and did some… worked at a video startup called DigitalFX, so built lots of systems. Then went off to school, came back, worked at Electronic Arts on video games.

STEFAN: Oh, cool! I didn’t know that.

ERIC: That was fun. We made a flying game on the 3DO which was a platform that was very [00:04:00] interesting for a short period of time.

STEFAN: And it’s all hardcore C++.

ERIC: Mm-hm.

STEFAN: The good old stuff.

ERIC: That’s right, trying to figure out- the fun thing about those applications was that you really had to figure out how to use the entire machine. Your app didn’t fit comfortably in the amount of memory and the amount of storage and the amount of CPU you had, so you had to understand the machine. That’s one of the things I really to this day look for when I hire people for this kind of work, is you want people who haven’t just written in Java or written in C, just plugged things together. You want people who have had to struggle something that doesn’t fit in the size, in the amount the amount of resources they have and have had to really learn how to program as a result.

STEFAN: Yeah. One of my favorite papers is, “Why you as a Java developer should learn assembly.” Did you ever read this thing?

ERIC: No. It makes sense to me, you should.

STEFAN: Yeah, right? To really understand the whole [HOP-son 00:04:58] switches and memory management and all that kind of fun stuff. It was a really interesting paper. They basically say, “You don’t need to write everything in assembly, but if you really understand the concepts, then a lot of stuff, including garbage collection and what-have-you really make sense.

ERIC: It’s always fun to just ask people, “How does a function call, how is it implemented,” or, “How does garbage collection work?”

STEFAN: Right.

ERIC: Right. If you’re not ready to work in system infrastructure, if you don’t understand those things…

STEFAN: Yeah. Okay. So, the flight simulator, and then…?

ERIC: Then back to… then I hitch-hiked around Europe for a couple of years.

STEFAN: Oh, that’s cool! A couple years, even? What was your, beside Germany, of course, what was your favorite?

ERIC: I’m actually, my grandparents are Swiss, so I actually managed to get a job at the ETH in Zürich.

STEFAN: Oh, cool. Beautiful city.

ERIC: So that was a home base. Not only is it [00:06:00] a beautiful city, but you can take the train in an hour and you can be in Italy, France, Germany, the world changes very quickly from there, so it was a great place to explore Europe from. So, did that, and then back to Berkeley for a couple of years, and then into Inktomi.

STEFAN: Okay.

ERIC: Which, it was just an amazing time to be looking for a job, of course, because the dot-com thing was just really starting.

STEFAN: Yeah.

ERIC: Eric Brewer, who was one of the founders of Inktomi was my advisors. I looked at a number of other places, but ultimately I decided I’d be nuts not to join this thing.

STEFAN: Right.

ERIC: So really, it was at the beginning of a sort of search engine revolution that happened throughout the dot-com era. Inktomi, kind of, is one of these companies that was huge, and then got very small near the end.

STEFAN: Yeah.

ERIC: But throughout that period we were just building and rebuilding and rebuilding this search engine. My part of it was actually building the content system. How do you crawl every document in the world and tear it apart and index it so that it can be found by the runtime search engine? So, that’s building a big data system, if you think about it. I ran a team from ’97 to 2005, both at Inktomi and then after the acquisition at Yahoo, that did that, and by the end… the ‘marketing number’ was we had a hundred billion document crawl. Probably more carefully, we knew of a hundred billion URLs. That was how Google was putting it so we did the same thing. We were crawling tens of billions of documents and managing hundred terabyte data sets, things of that sort, which back in 2003, 4, 5 were really big numbers.

STEFAN: Yeah, definitely. What do you mean with the content system? Was that part of the crawling or the [post 00:07:55] processing?

ERIC: All of that, of discovering and fetching every document [00:08:00] in the Web, i.e., crawling it. Then, okay, you have a pile of tens of billions of documents, how do you turn that into something that a search engine can use? That means building a web map so you can do page rank and all those things, so taking every document, tearing it apart into sets of terms, then sorting all that information by the terms, sorting it by the document that the link is referring to. Basically, doing petabyte scale sorts. We built a series of infrastructures to do that, and just storing and managing all that data. We built, rebuilt, and rebuilt that system about four times by 2005.

STEFAN: Wow.

ERIC: Then 2005, actually what happened was, we were thinking of re-architecting it again, and the science team started coming and banging on our door saying- he didn’t stop at my door, they were coming, they were giving me a lot of advice (laughter). That they would love to have a system like these papers they were reading from Google, because they wanted to do research on all the documents that we were crawling and we’d built these thousand-computer clusters, one of them to crawl the web, one of them to do the sort of page rank calculations, one of them to actually build the final indexes that the search engine used. But all of that hardware was dedicated to that one purpose…

STEFAN: One thing, yeah.

ERIC: …and they couldn’t use it. We were looking to re-architect all that anyways, so we said… that was at the point where we said, “Let’s build the system based on the MapReduce interfaces and the HDFS interfaces. Well, the Google File System interfaces. We knew we could do it because we’d built four of these before. We already had a thousand-net system that was effectively running MapReduce, but the APIs suggested by the Google guys were better. We looked at that and said, “Okay, if we rebuild it this way, then we’ll be able to [00:10:00] reuse that software in many more applications.”

STEFAN: Before we go into more of that really cool infrastructure, I want to… [I will click 00:10:08] a little bit on that challenges around search engine, because I work that area as well and was always like, “Boy, how do you get to such a big crawl?” And there’s all those little tricks and HTTP keep-alive, and DNS-caching was a really big problem for us, especially in the Nutch days, because we basically didn’t go to one house and download all the documents, but we took URL-by-URL, so we had to do a DNS lookup for every URL we had to… how did you guys manage to crawl that much data?

ERIC: Some of those problems get easier at scale. Some of them get harder. So, you would partition the crawl by host, so things like DNS-caching were not a big deal. You obviously do need to do it, but the biggest problem we had was being nice to the world. I learned a lesson back in, around ’99, when one of my engineers running a test crawl on his laptop took down Microsoft.

STEFAN: (laughter) That was you?

ERIC: It was just one… but you think about it, how many websites, especially back then, were prepared to have somebody open up and start to fetch thousands and thousands of pages simultaneously. They just weren’t architected for it. Today I don’t think you could take down Microsoft from your laptop by just asking for its pages, but back then you could take down any site in the world that way.

STEFAN: Right.

ERIC: Just figuring out how to be polite. Then you have the reverse problem, which is…

STEFAN: Right, you’re not fast enough.

ERIC: …well, I need to get a billion, I need to get millions and millions of pages out of this website. How do I sequence that? So yeah, there’s a lot of tricks. You partition it where you could, then you had to keep a list of active hosts. The other one that really was a pain was [00:12:00]-

STEFAN: And active host means open connections?

ERIC: Yeah. You had to keep a list of the hosts that were big enough that you couldn’t just take the… by default you’d just take all the URLs and just randomize them and you’re done. That’s a strategy that Nutch probably used, [hash by the URL 00:12:15]…

STEFAN: But then we had the DNS lookup problem that slowed us down.

ERIC: Which, yeah, there’s that problem, and there’s the problem that some hosts are just too big for that and you’re still going to violate politeness if you do that. For those big hosts you would need to come up with a strategy. Some of them were just like, Yahoo, they have a lot of pages, but what really got us in trouble were things like affiliate networks. If you look at Amazon, they keep your cookie in the URL, and that means that every… you can crawl the whole site as a unique site per URL in the world, it’ll just dish you out a new one every time you visit. When we first… we’d go back and look at the crawl logs and realize that, “My gosh, we have a billion pages from one website. What is going on?” Then we’d have to go back and figure out rules. You could just ban the site, that was the first solution. There’s better solutions.

STEFAN: Did you build a cookie management system?

ERIC: Mmm (affirmative). We did a little bit of that. After I moved off the Hadoop project they got very sophisticated. They actually built a crawler that used Mozilla as its core.

STEFAN: Oh, okay.

ERIC: So they were building the whole page [00:13:27].

STEFAN: To execute the Java Scripts… oh, cool.

ERIC: Because, again, in the early days, a page was a page, but at this point, what is a page? Unless you can… is it Flash? Is it… who knows what, right? There’s just rendering it out so you can understand what a human is seeing gets to be a hard problem.

STEFAN: Now I know why Chrome is that high up on the whole stats for the [process 00:13:50] because Google is just using Chrome on them. Sorry, I interrupted you. You were going into more of the infrastructure side before I sidetracked you with all that cool [00:14:00] search engine stuff.

ERIC: Sure. That was a fun journey, of course, but by 2005 we looked at this and it became clear that there was an opportunity to re-architect our system. We had a number of different problems that we were trying to solve. One was, make the data accessible to scientists; one was, build a larger-scale version of the crawl. They always needed to get bigger every year, and the clustering systems that we had worked at a thousand nodes but they were starting to spend more time doing, sort of, handling error recovery than actually running the code, because as you get to that scale, things fail all the time. If you’re not very clever about how you handle failure, what you think of as an exception case is the dominant case in the code. Every time a node would fail, the system would go down for a few hours to rebuild…

STEFAN: Ooh, okay.

ERIC: …so we were failing, nodes were failing frequently that the system was spending about half its time rebuilding. That was still okay, but if we doubled it again, then you wouldn’t see any improvement to that size. Not unless you start spending more for your hardware, which you don’t want to do, since that was a multi-million dollar budget as it was. We were looking at all that. Hadoop solved that problem, it was clear how it could solve that problem. It had the built-in failure design that was more sophisticated than what we were doing. We also had this sort of science recruitment problem, which was that, to build the best search engine and advertising systems in the world, we needed to hire great people. At that point in time, nobody thought of Yahoo as a place that did big data science, so we decided, “How are we going to put ourselves on the map?” We could build a different system than what Google has done and publish that design, but they have more systems researchers and more staff, [00:16:00] and we’d be a second system. It’s not going to get a lot of attention. But if we take their inspiration and open-source it, that will put us in the conversation in a completely different way.

[16:16] That was kind of a decision that we made as a company, was that we want to actually put Yahoo on the map by building an open-source big data system. We knew we could do it because we built several. We started out with the Google papers, and we then actually went through this whole design process of, “How are we going to refine this, make it better, put our own touch on it?” We changed and changed and changed and we got to a point where we had a completely different design than what the Google papers suggested. Then we looked at that and said, “Well, what’s going to happen, we’re going to open-source this thing, we want it to be adopted, what’s the best guarantee we have of getting it adopted?” It’s to conform to a familiar template. These papers are out there, they’re a template that everybody understands, so we took the design all the way back down to being very similar.

Another thing we decided really early on was we, if we clone these papers- clone is probably not a word that people would like to hear. But if we use these APIs as our inspiration, then one of two things happens. Either our open-source project wins, everyone adopts it, and then there’s huge benefit to us because the world is contributing the infrastructure we use, we get to hire people who already know how to use our infrastructure, et cetera. Or, we lose, somebody else builds an even better open-source version of these papers. Because somebody will, and then what happens? Then we have a very easy job of porting all of our infrastructure to the dominant paradigm, and again we win. Open-sourcing seemed like a really good bet. Conforming to those papers seemed like a really good bet, [00:18:00] although that was, it was interesting, the sort of nimbyism- not, nimbyism’s the wrong word. Not invented here, NIH. People, even years after the Hadoop project was just going gangbusters, people would stand up in big town-hall meetings in Yahoo and say, “I can’t believe that Yahoo’s not innovating, because Hadoop is just a copy of the Google papers. When are you going to do something original?” It’s like, well, you know (laughter). Great ideas happen elsewhere, I forget who said that, was that… I forget who said that, but it’s a valid quote. What’s important is not proving that you have a clever idea, what’s important is executing.

The choice of open-sourcing a MapReduce infrastructure did tremendous good to Yahoo. In the end, we managed to hire and retain a great team, although Yahoo was going through some challenging times, shall we say. Much better than that, we really put Yahoo on the map in terms of hiring scientists. They built a really world-class science organization that drove a lot of innovation across all of Yahoo. The story gets to Hadoop, actually, in another step, which is, having made this decision, we staffed a team out of the people that had built the last four crawling systems and started building a prototype called Juggernaut. But then at the same time, Raymie Stata, who was at the time the chief architect of search and advertising, he later became the CTO of Yahoo, had hired Doug Cutting. Doug Cutting had built this prototype of MapReduce and the Google File System and the Nutch system. You’re probably… you were already using that at this point…

STEFAN: Oh, yeah, I was working with him since 2003 on that.

ERIC: That puts you in a very rarefied crowd.

STEFAN: One of three, yeah.

ERIC: But yeah, so you were one of the [00:20:00] very few users of Hadoop before it was Hadoop. So Raymie started suggesting that I adopt Hadoop as the foundation for this open-source system we’re going to build. This was something that we discussed and debated for about six months before finally I concluded, okay, this makes sense, we’re going to do it. It’s funny, we should get back to the Java / C++ thing in a second, because that was one of the things, but we were looking at this and going, “Our design is much better than Doug’s implementation. We know how to build these systems, we’ve built four of them, why should adopt this thing that is really just a prototype?”

STEFAN: Right.

ERIC: But then we looked at it, and it’s like, “Okay. It’s a prototype, but it’s a prototype in Apache, and it’s built by someone who’s built successful open-source projects before, and who can teach us a hell of a lot about that.” There’s this huge, jarring transition we made where I took the team that we’ve built to build such a system, threw out our prototype, and adopted Hadoop, and then started the process of understanding it and refining it.

STEFAN: So, before we go into more of that history of Hadoop at Yahoo, and I certainly want to know more about whole C++ and Java discussion, I’m sure that will be fun, we will make a break, have a few more beers…

ERIC: A few more beers (laughter)

STEFAN: In between, so we…

ERIC: So we’ll be [00:21:36] singing when we come back!

STEFAN: Exactly! Well, that’s the idea, I want to hear all the songs about all the interesting things that happened behind the curtain. We’ll be right back.