How involved should Data Engineers be in learning Machine Learning Algorithms?

For the past few years Data Scientist are one of the hottest jobs in IT. A huge part of what Data Scientist do is selecting Machine Learning Algorithms for projects like SmartHomes and SmartCars. What about the Data Engineer, should they know Machine Learning Algorithms as well? Find out in this episode of Big Data Big Questions.

Welcome back. So, today, we’re going to talk a little bit about algorithms, right? So, you know, put your math hat on, and let’s dive into this question today. And so, today’s question, it’s one I get a lot. It’s about the role of a data engineer in machine-learning. And basically, it is… You know, I’ve taken this question from a couple of different sources that I’ve seen, where they’ve asked, you know, “Should data engineers know machine learning algorithms?” And kind of where some of that falls into is, you know, what is the role of the data engineer, and what is the role of the data scientist? And so, really, this question, for me, is really simple. I’m going to go off of my experience and kind of share with you what I’ve done around machine-learning algorithms and how I’ve approached it in my career as a data engineer, software engineer, you know, Hadoop administrator.

There’s a couple of different ways to look at it, but basically, the way that I’ve approached it is I haven’t really learned it. And when I say, “Learned it,” or, “Know it,” I’ve not been in…you know, I’m not going to make a recommendation on it. So, you know, the way I look at it is you should be familiar with them. So, you should be familiar with them, especially familiar with them as far as, like, what’s involved in the package? So, are you using Mahout? You know, what are the algorithms in there, what are the algorithms in your workflow? And then, all the other libraries too.

So, if you’re evaluating other libraries…so, maybe you guys are looking to…you know, maybe you haven’t used Spark and you want to look at the e-mail library that’s there, and you’re kind of going back and forth through those, you want to understand from a basic very high-level, you know, what those algorithms are, and for sure, what algorithms you’re using in your environment, so you can make an educated recommendation saying, “Hey, you know, I think we should move this. Let’s still have the data scientists involved, and have them, you know, look and make sure that the algorithm that we’re going to be using from those packages are going to fall in line with what we’re really using,” because that’s one of the things too, you’ll find that they will differentiate a little bit, so, you know, what we’re using in my house may not be exactly the same, you know, version in, you know, MadLib, or, you know, the ML library.

And so, just be able to understand kind of for sure what’s in your workflow. Be familiar with them too. Another thing that I did…so, like I said, be familiar with them from a high level, but not be making a recommendation, I actually did, you know, picked one, so I would say, you know, be familiar with them, but pick one that you really want to…you know, really want to understand and learn. I picked Singular Value Decomposition, because that’s something that we used a lot in our workflow, and so, I was just kind of…had a natural curiosity for it, and it…you know, it had a really cool story too around it. So, you know, I found some stories around it, you know, it was made really popular with the Netflix Challenge. So, back…Netflix had a challenge for…you know, to, “Beat our data scientists with your algorithm.” And so, SVD was used to, you know, do some of the sorting there, and it was kind of made famous from that perspective, and so, you know, I was familiar with it, but I made sure that I understood one, just for natural curiosity.

Now, if you are looking to, you know, at some point, make a jump, right, to data scientist, if you’re a data engineer, and at some point down the road, you’d like to be…you know, “I want to be the data scientist. I want to say, ‘Hey, this is the algorithm we should use.’” You know, maybe you just want to be a data scientist because, you know, for a couple years running, it’s been the…you know, the sexiest career, you know, in IT for a while, and so, if that’s kind of your approach, you know, definitely start to know them.

Obviously, learn the ones that are in your environment first, because that’s going to be the easiest, because you’re going to have the access to, you know, why you’re using it, how you’re using it, and you have access to the data scientists too, to kind of, you know, take you under their wing, to some extent, and, you know, show you the ins and outs of why you’re using what you did and, you know, kind of why you didn’t use other ones too. For an aspiring data scientist, then yes, for sure, you want to jump in and, you know, start to understand and start to know them. But for a data engineer, I don’t think you have to learn the algorithms, right? I think you have to be familiar with them, I think, you know, for natural curiosity, you know, maybe learn one or two.

But really, our role is not to recommend and say, “Hey, you know, these are the algorithms I think we should use,” or even, like, to pick packages and say, “Hey, these packages here, we’re going to…you know, we’re going to standardize on that and that’s the only thing we’re going to use.” That’s…you know, that’s not really our role, right?

If you have any questions, make sure you submit them to Big Data, Big Questions. You can do it from the website, go to Twitter, use the hashtag #BigDataBigQuestions, in the comment section there, however you want to get in touch with me and get those questions answered. Also, make sure you subscribe so that you never miss all these Big Data, Big Questions goodness, and so that you can always, you know, learn more. Thanks again, folks!

The best time to plant a tree was 20 years ago, the second best time to plant one is today.

Okay so what does this old Chinese proverb have to do with learning Hadoop? Let’s break down the proverb. Trees are awesome when they are huge and provide a ton of shade or have large branches for tire swings. However to enjoy a tree like this it has be planted a long time ago. I live in a new neighborhood so I’m out of luck.

Learning a new technology is like planting a tree. Everyone wants to enjoy the shade or be an expert without having to put in the time.

Hadoop & Spark are hot topics right now in the Dev/IT space. Many companies are looking for experts in Hadoop. Truth be told there aren’t many out there. The technology is new and evolving daily. Just checkout this blog post I wrote over a year ago about the popular frameworks, the number of new projects has doubled since that post was written.

Reason to learn Hadoop today..

So why don’t you become an expert in the Hadoop space? The best time to to start is today. Sign up for my newsletter to learn how you can become a Hadoop expert.

You can command a higher salary (average 140K/year).

You can get in on the ground floor of the Big Data movement.

You can contribute to the “Big Data” frameworks through the opensource community.

Chance to work with enormous data sets. Well it is BIG data…

Opportunity to change the world with data.

Huge community support.

Work some of the biggest companies on the planet. Facebook, Verizon, Netflix, MLB,…..

Cutting edge technology that is constantly evolving.

Internet of Things

Get to play with Hadoop’s friends Sqoop, Kafka, Pig, Hive, HBase, and many more.

Ever wondered: How did Google get started? What about what is like to work at Google from day one? How did Google build an empire in it’s first 10 years? If so then pick up a copy of In the Plex: How Google Thinks, Works, and Shapes Our Live.

Overview

In The Plex: How Google Thinks, Works, and Shapes Our Lives begins with Google starting as a thesis project for the Larry Paige and Sergey Brin as PhD candidate students at Stanford. The founders had great access to resources and talent while at working on the project at Stanford, but they soon realized for their search engine to grow they would have to move from a research project to company. The founders were not concerned with making money, it more was more about the cost of crawling the web and storing that data too big for a PhD project. The book covers Google from inception to 2011 about a ten year time span. During this time span Google starts out as a small start-up renting a garage focusing solely on Search at that time. In the late 90’s the Search business was not a profitable business model. Since Search was not profitable Search Engines gave poor results until Google came around. Google’s only real competition in the early years came from Excite but Excite’s growth was capped because the parent company did not believe Search would be profitable. In the end it turns out Search can be very profitable, I mean profitable in the 10’s of billions. In The Plex covers Adsense, Gmail, YouTube, and other Google technologies as well.

Googly Culture

Are they Googly? The culture of Google is modeled around a college campus employees work as students and their manager acts more as Professor rather than a traditional manager. In a typical week an employee works 80% of the time on their project but are allowed to work the other 20% of the time on a project of their choosing. From the start the founders insisted on hiring research minded Computer Scientist from elite Computer Science Universities, this combined with the 20% rule lead to many many innovations at Google. Many innovations developed at Google were published but by the time the results were released Google already had a huge lead on their competitors.

Map Reduce

One of the biggest innovations that came out of Google in the early years was the Map Reduce project. Map Reduce was published by Jeffrey Dean and Sanjay Ghemawat in 2004. Map Reduce basically gives Google the ability to process large data sets in a relatively short amount of time. The Map Reduce paper was the brain child behind the Open-Source Apache Hadoop technology used by Yahoo, Facebook, and many others. Any company that deals with large amounts of data is using a Map Reduce related product.

Should you read it?

Google is the pioneer in big data before it was Big Data. So much of the big data buzz today is built around those innovations and practices developed by Google. Regardless of what your job title In The Plex will be very beneficial. In The Plex enabled me to have a practical application of how big data and machine learning can benefit the user. Companies are leveraging big data analytics for multiple purposes, by reading In the Plex you can see where is all started.

Tell me what you thought of the book or maybe you have a book that is similar that you think I should read. Just post in the comments below.