Training a New Generation of Data Scientists

Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.

Why is Cloudera offering data science training?

The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.

Right now, one of the biggest barriers to the widespread adoption of Hadoop is the supply of data scientists, the peculiar blend of software engineer and statistician that is capable of turning data into awesome. We’ve started to see data science courses develop at universities like Columbia, The University of Washington, and UC Berkeley (taught by Cloudera co-founder Jeff Hammerbacher). While these courses provide excellent instruction to a new generation of data scientists, the instruction they provide is necessarily limited to the students who are enrolled in those institutions, and the need for data science training is much broader and much more immediate.

Earlier this year, Jeff and I started working with Cloudera’s training team to distill our experiences at Facebook and Google into a course that would teach the fundamentals of data science: everything from the pragmatic application of machine learning and statistics to business problems to the data ingest and preparation that is so critical in our work. We hope that by sharing our experience and showing how we take advantage of Hadoop to solve problems, we can help address the shortage of data scientists.

What are the goals of the class? What do you expect students to get out of it?

First, we want the course to cover the lifecycle of a data science project, from data acquisition and preparation through model development, production deployment, and evaluation. If you want to sample from a grab bag of methods in machine learning and statistics, there are lots of courses to choose from; we wanted our course to teach students how to build data products.

Second, we want our students to understand that data science isn’t nearly as difficult as it is made out to be. It does involve some new tools and a different way of thinking about problems, but it doesn’t require any skills that can’t be taught to a motivated student and then improved upon with practice.

Third, we want data scientists to understand that they are force multipliers within an organization, and that everything they do should be oriented towards making everyone- decision makers, suppliers, and customers- more effective at using data to make decisions.

Why is the focus on building recommender systems?

Recommender systems are an ideal way to learn about data science with Hadoop, if only because of how simply and clearly a recommendation engine can demonstrate the unreasonable effectiveness of data. But that isn’t the only reason we wanted to build the course around recommendation engines:

The mathematics of recommenders is simple to understand. We wanted the course to be approachable by people who hadn’t taken a math class in quite some time, and other kinds of problems (e.g., building classifiers for predicting ad clicks or fraudulent transactions) require at least a little bit of calculus. We didn’t want the course to get bogged down in the technical aspects of the modeling problem and lose focus on the practical techniques that data scientists need to do in their day to day work.

The skills required to build a recommender generalize well to other problems. We felt that the process of building a recommendation engine perfectly illustrated all of the steps invovled in creating a data product. No matter what you do in your career as a data scientist- and you will do a little bit of everything, from creating dashboards, to advancing the state of the art in machine learning, to reconciling what your customers say they do in surveys with what they actually do in transaction logs- what you learn in this course will serve you well.

What are the prerequisites for the course?

The course is appropriate for software engineers, statisticians, and business analysts who are familiar with basic Hadoop commands, Hive, and a scripting language like Python, Perl, or Ruby (the labs in the course use Python). There isn’t any Java programming in the course, but we do discuss and make use of Mahout’s commandline tools to create recommendations. We will also show you how to use R to visualize data and perform simple data analysis tasks.

Data science is an interdisciplinary field, which means that there will be parts of the course that will be more or less familiar to you depending on your background and experience. We also want to emphasize the importance of communication and teamwork in data science: there will be some labs where you guide other students, and others where you may need help from students who have more experience. This is very much by design; no single person is an expert at every aspect of data science, and learning how to work as part of a multidisciplinary team is crucial.

How will the certification program work?

Certifying data scientists is difficult, as the ability to create data products is the real mark of a practicing data scientist.

Cloudera is going to do something new for our data science certification program: we will be combining a written exam that ensures students have a basic set of skills and knowledge with a hands-on exam that is designed to measure both technical ability and the capacity to develop creative approaches to building data products. You won’t be required to take the data science course in order to take the data science certification exam, but it will certainly help. We will be announcing more details about the certification exam process in January, after we’ve had our first cohort of students go through the data science course.

When can I take the course?

I’m so glad you asked. I have the pleasure of teaching the first course in the Bay Area myself, Nov. 14-16, 2012, and our training team will offer the second course in New York on Dec. 12-14, 2012. We will be teaching the course in additional locations based on demand; you can keep an eye on the schedule of public training courses here, and we’re always happy to do onsite training classes that are optimized for the needs of your team. We look forward to seeing you in class.

2 responses on “Training a New Generation of Data Scientists”

Do you think you will be doing your trainings online anytime soon? I think the courses you offer are great, but being a resident of India, it is prohibitively expensive to be able to attend any of your classes. Do you have any tie-ups within India that deliver your trainings?