Big Data, 'Breaking Bad' And Orange Juice

MailChimp's chief data scientist has a unique way of explaining data science to the masses. If you know a little Excel, you're ready to learn.

Big Data Analytics Masters Degrees: 20 Top Programs

(click image for larger view and for slideshow)

What's the best way to teach data science to people who lack sufficient training in analytics, computer science, modeling and statistics? A little humor can't hurt, perhaps, particularly when delving into potentially mind-numbing topics like algorithms and programming.

John Foreman, chief data scientist for MailChimp, an email marketing service provider, is doing his part to demystify the dark art of data science by speaking and writing extensively on the subject. His data science blog features an eclectic blend of real world -- if a bit unorthodox -- examples of big data in action, including fictionalized accounts of Breaking Bad-style revenue and production dilemmas that illegal drug dealers may encounter.

It's all in fun, of course, and not all of Foreman's data science examples are nefarious by nature.

"The blog was supposed to be like Breaking Bad for data science, so all the examples were from a drug dealer's perspective," he said in a phone interview with InformationWeek.

In his upcoming book, Data Smart: Using Data Science to Transform Information into Insight, Foreman ditches the druggie references -- a decision designed to please his publisher and parents -- and focuses on more agreeable examples, such as how to devise an optimization model to keep bottled orange juice tasting just as sweet throughout the year. (Foreman knows this problem well. Prior to his MailChimp days, he was a management consultant who did analytics work for Coca-Cola, maker of Simply Orange, a not-from-concentrate juice.)

Foreman's message is this: Don't fear data science. In fact, with a little effort, you might even learn it yourself.

"I think of data science as taking raw data and turning it into something you can make business decisions off of. But big data just means you're doing math with a lot of data," said Foreman.

His data science book, available in October, falls somewhere between rigorous, highly technical textbooks chock full of mathematical equations, and lightweight overviews that don't teach algorithms or the process of data science models.

If you're not a coder, fear not. A spreadsheet is all you need to get started, he claimed.

"The cool thing about a spreadsheet is that you see every step. They're really unsexy, but that's fine because the book is not supposed to be sexy. It's not supposed to mystify people," said Foreman. "We do spreadsheets to show people that data science is tool-agnostic."

After covering eight core data science techniques, the book introduces readers to the programming language R, which is commonly used by data scientists to develop predictive models.

"I say, 'Hey, we're going to start at the very beginning with R, and everything you just did in the previous chapters, we're going to replicate it in R. You're going to see how easy it is, and why everyone does it in code as opposed to a spreadsheet,'" said Foreman.

He added: "You actually build the model yourself, by hand, on a simple example with clear explanations. That way people can feel comfortable, and not just plug in their data and pray that it's going to work."

The book is designed to teach business folks who may feel left behind by the big data juggernaut.

"It seems like there are a lot of people who don't want to raise their hands, speak up, and say, 'Hey, I'm kind of afraid to learn how to code, and learn how to do math at the same time,'" Foreman said. "They feel embarrassed to say that they're left behind."

InformationWeek 500 companies take a practical view of even trendy tech such as cloud, big data analytics and mobile. Read all about what they're doing in our big new special issue. Also in the InformationWeek 500 issue: A ranking of our top 250 winners; profiles of the top five companies; and 20 great ideas that you can steal. (Free registration required.)

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.