mebassett programming

Tuesday, October 23, 2012

I had an opprotunity to speak at the IE group's Big Data Innovation Summit in Dublin last Friday. There was a good mix of people from several different industries - I think I learned more from the audience then they did from me. In any case, I thought I'd post my notes and slides here, in case someone found them useful/interesting. These are my raw notes, however, so I apologize for any grammatical or spelling errors.

Adventures in Functional Big Data:

How Data Sciences drives non-technology companies to make some cutting-edge technology decisions.

Main idea: Functional programming (when combined with research experience and core business knowledge) can extract value from big data
and resolve the technical challenges by turning the data into an interactive model.

The massive amounts of data available to today's businesses affords us
many opportunities to optimize our business process and reveal new ways
to create value. But doing so presents a number of research and
technology challenges. I want to tell a story to explain these
problems, and how a combination of functional programming, research
experience, and core business knowledge can be combined to resolve them
and make one's business more effective.

Hi everyone. I'm Matthew. I'm the Senior Data Scientist with Universal
Pictures International. I make machine-learning models to help our
business run better.

I want to talk to you today about our adventures learning from big data.
In particular, I want to tell you about the problems we have in getting
the company on board with big data, and how I've used some unorthodox
technologies (functional programming, e.g., Lisp and Racket) to bring
business managers closer to the data.

To start off, I want to explain the relationship between "big data" and
"the business", and then tell you a story to explain the problems these
two have in getting along, and how functional programming helped bring
them together.

"Big data" and "data science" are still new, vaguely defined terms. In
fact, without some context of your business, "big data" and "data
science" are just interesting academic ideas involving statistics and
computing. It's important to keep in mind that it's the business domain
that makes it come alive.

Now, if you're a technology company, like Google or Zynga, your "big data"
likely comes from web and mobile statistics, A/B testing, et cetera. Your
"data scientist" are combing through it, looking for usage patterns and trends
to exploit, so you can improve your existing technology products, or
create new ones.

But not all of us are technology companies.

Our company isn't quite like that. We're not trying to make a better
product, we're trying to learn which of our products the market wants and when the market wants them. We don't have web and mobile statistics.
Instead, we have a variety of different data coming from hundreds of
[dirty] sources. Once we have it all cleaned up, we're hoping it can
tell us how to make better business decisions. I expect that some of
your businesses are similar.

But there's a problem here! There isn't a lot of "experiential
overlap" between the folks who know and run the business, and the folks
[like me] who know and can interpret the data.

Indeed, the directors and senior execs at our company are, of course,
intelligent people who know the business inside and out. But when it
comes to the statistics and mathematics behind "big data", well, they
think "the standard deviation" could be the title of an upcoming film.

Even so, they're aware that there is this data is out there, and that it has
these useful nuggets of value. It's a vast dark forest, like Mirkwood
from The Hobbit. It's too big, mysterious, and dangerous for them to
venture into.

But they know that there are these wizards, these Data Scientists, who
can help them navigate the Data Mirkwood and find the treasure. The
challenge here is for the Data Science Wizards to use all their sorcery,
Hadoop and machine learning algorithms and the like, to shed some light
on the Data Mirkwood and present it in a way that makes sense of the
business. We'll see soon how one extra bit of magic, functional
programming, can help them do that job. But before that, how does this
relationship usually play out?

Well, not like an adventure at all. It's a mess!

On the one side of the equation we've got these execs and directors who
have been hearing about all this "big data" and "machine learning"
stuff, and reading articles about how "data scientists" are "sexy".
They're smelling ways to maximize value and want to put it to use.

Our best example of this is our Pointy-Haired Boss, let's call him Paul. Paul practically lives on data. But he doesn't go into the
Data Mirkwood himself. He just asks for reports from it. And he's
always asking for reports. He wants his excel spreadsheets on this and
his PDFs for that. There's a lot of redundancy, and none of these bits
of data are very intelligent. He doesn't want to get too deep into the
Data Mirkwood. He just wants something relevant that makes sense.

On the other hand, you've got the analyst trying to prepare all these
reports. Just when they have a process for generating excel
spreadsheets one thing, Paul the Pointy-Haired boss asks for a PDF on
another. They're busy jumping around like data bunnies in our big data,
trying to clean up this data source and that data source to keep up with
it all.

It's all fairly superficial stuff - the data bunnies are cleaning up and
reporting pieces of the large set of data, but they don't want to see
the whole thing. And they certainly can't see large scale patterns and
trends that are emerging from it. Paul the Pointy-Haired Boss doesn't
even ask; he doesn't know or care how Bayesian clustering algorithms
would help him make better decisions, and the bunnies don't know how to
do that anyway.

This is where our Data Scientist comes in. We'll call him Duncan

Duncan the Data Scientist is this mystical wizard. He knows the Data
Mirkwood well, he's got deep sorcery to make it come alive. He can
Hadoop and MapReduce and Machine learn the data backwards and forwards.
But he also knows the business, and can direct Paul the Pointy-Haired
boss to the right questions.

Duncan the Data Scientist is tasked with guiding Paul the Pointy-Haired
Boss through Data Mirkwood. Paul needs to know the trends and patterns
hiding in Data Mirkwood, but he lacks the computational and statistical
know-how to even ask the right questions. Paul knows the business, but
Duncan knows the statistics.

What does Duncan need to be successful at this?

With a CV like this, I have to say that Duncan is sounding less like a
scientist and more like a unicorn! In reality, Duncan has the maths,
stats, and programming background to dig deeper into Data Mirkwood than
do the data bunnies, but his knowledge of the business isn't nearly as
deep as Paul's.

So how can Duncan get Paul into Data Mirkwood? How can he get Paul into
the data in a way that he can understand?

Duncan needs to create a usable interface over the data for Paul to use.
By "usable interface" I don't mean "data visualization". I mean
something that Paul can interact with, a method he can use to pose
questions within the context of his business, id est, Duncan needs to
use the data to create a place where Paul can ask "what does the data
say about this aspect of my business"? Duncan needs to create something like a "business-domain specific language" for querying the data.

At my company, the interface we built was a simulation. We
cleaned up our big data, and we continually feed it through a variety of
regression and classification algorithms. These algorithms then tune
various parameters of our simulation, and enable us to ask questions
like "what happens if this product flops?" or "when will there be room in
the market for this product to survive?". The nature of your usable interface
would be specific to your business, naturally, and would come from a
collaboration between your Duncans and your Pauls.

But building such an interface is no walk in the park either. Say Paul
and Duncan have a good working relationship. They understand each
other, and Duncan is able to translate Paul's concerns into technical
questions to ask about the data. Duncan still has to turn these
questions into programming experiments, and then into a usable interface
for Paul. He's got to do his "data science/machine learning"
programming, and somehow feed that into his "application programming".

Typically, when one is programming in big data, the languages one has at
his disposal fall in sort of a grid:

On the horizontal scale we have expressive languages that allow Duncan
to express his statistical and mathematical analysis for the data. But
these don't scale well. R is fine for small data sets, but it chokes a
bit when you try to feed it petabytes of data.

On the vertical scale you have your high powered technologies. These
are the sorts of things you'd use in for MapReduce over your Hadoop
cluster. These are the languages you use to take Paul's questions and
turn them into massively parallel computations that find patterns in the
data.

There's a bit of a gap of languages that can do both. Often, Duncan
finds himself prototyping and experimenting solutions on the horizontal
scale, but has to re-write them on the vertical scale.

And this is where functional programming comes in. Functional
programming languages like Lisp/Racket fill that void:

Functional languages, like Racket, Clojure, Haskell, Scala, et cetera
are incredibly expressive. They differ from other languages in that
they don't tell the computer what to do.

For instance, in a functional language, you don't do things like "set
the variable x to 5. Add 3 to x and store it in y".

Rather, in functional languages, everything evaluates to a mathematical
expression. You describe programs by writing down a mathematical
transformation from your initial state to your desired state.

To do your data analysis, you'd describe the mathematics and statistics
behind it, defining your input data and how the output should look.

To do Paul's interface, you'd describe the information you want from
him, and how to turn that into an answer from the data, and return it to him.

Racket is a new functional programming language, and is incredibly
expressive. It can allow Duncan to express his machine learning
algorithms and his statistical analysis, but it's powerful enough for
him to launch his parallel computations over a large cluster (e.g.
Amazon EC2).

But functional programming itself, and languages like Racket, aren't
actually that new. Even the holy-grail of "big data", MapReduce, the
algorithm that makes any of our analysis possible, is based on ideas
from functional programming. The name comes from a common idiom in
functional languages - to map over a data set and reduce it to something
simpler.

The benefit in using Racket is that Duncan can do everything in the same
stack: he can experiment and find solutions in Racket. He can do his
large-scale computations over the data in Racket. And he can build an
interface in Racket. Paul can access to Duncan's work more quickly.

Gone are the data bunnies. Paul is working much more closely inside
Data Mirkwood. By using Racket, Duncan has built an interface on top of
the data and allowed Paul access to a much more sophisticate insight,
and everyone's happy.

We use functional programming, especially Racket, on a day-to-day basis
at my company. Our simulation and machine learning algorithms are
running on it now. I'd be happy to answer any questions you may have
about it.

Post-talk thoughts

The above talk is slightly incoherent, and doesn't do a good job delivering on it's promises. In particular, I could have done the following better:

It doesn't explain how functional programming is useful for building these interactive models. I just state that it's better without evidence or examples. I could have talked more about our specific setup, and highlighted where judicious use of racket made our life easier. But maybe that's best left for another talk or blog post.

Additionally, the audience of the talk (and the panel discussion afterwards) were very interested in the talent shortage problem that I alluded to. Another gentleman, Jean-Christophe Desplat, Director at ICHEC, spoke about a solution in his talk: hire a team of talented people who can work together. I fully agree. That might make a good talk/blog, too.

Finally, I should have discussed the analogy between a "usable interface on the data" and a "business-domain-specific language for querying the data". That's really the core bit of the talk, and I barely mentioned it!