User Centric Design with Big Data

How do I know? Well it is simple; almost everyone evaluates situations in the world using metrics that do not represent their goals with high-fidelity.

For example, Peter Thiel is a great businessman and smart guy but like everyone else, his metrics are broken. His thesis is that “innovation is dead”.

“If you look outside the computer and the internet, there has been 40 years of stagnation,” said Thiel, who pointed to one of his favorite examples: the dearth of innovation in transportation. ”We are no longer moving faster,” Thiel noted. Transportation speeds, which accelerated across history, peaked with the debut of the Concord in 1976. One decade after 9/11, Thiel says, we are back to the travel speeds of the 1960s.

Is going faster and faster a good measure of progress? Is there a point where transportation is fast enough? It is clear that there is not technological barrier to having faster planes but society has made it clear that it does not care to invest in that area to gain that extra speed. Maybe other metrics like miles/passenger/(Joule of energy used) is a more relevant metric. Or maybe it is bad too.

Progress != Growth:

Most people associate progress with growth, but GDP growth by itself is not a good long-term goal because it cannot go on forever. If growth is not sustainable then we should not go after it past a certain point. I do not know the right metric to tell how sustainable a unit of GDP growth is, but I do know that a sustainability component is required to fix the metric.

Why this matters a lot

Creating metrics that reflect your goals (as a person, company, country, ..) is important because people and organizations optimize their activity to metrics. If you are a politician who is judged by whether GDP goes up, you will pursue polices that try to increase GDP. If you are a public company that is judged by short-term earnings growth then you will put a lot of energy into optimizing that.

Fixing metrics is simple but hard

Fixing metrics is very hard in practice but it is conceptually simple because the reason for broken metrics is usually easy to identify.

Top three reasons why most metrics are broken:

The metric is venerable. It used to make sense but the world changed and it is not hi-fi anymore.

The metric is too simple. The world is complicated and goals are similarly complex. Simple metrics usually leave out important factors. People like simple metrics so they get popular and gain momentum.

The metric looks for keys under the lamp post…rather than down the street in the dark where you dropped them. This is related to being too simple but complex metrics can also have this failing. Some goals are hard to represent with metrics with high-fidelity. But that does not stop people from creating metrics to measure those goals. Those metrics are usually chosen for convenience rather than fidelity. An imperfect metric is fine as long as people are aware of the problems and use the metric accordingly.

Even after you figure out that your metrics are broken it is really hard to fix them. A hi-fi metric provides real insight into the world and that is always a challenge. You may even conclude in some cases that there is no simple collections of metrics for a given goal. But fixing your metrics (or your understanding of your metrics) is crucial because failure follows a bad metric around diligently.

Map-reduce is great. It has made it possible to process insane amounts of data on commodity hardware. However it is a very low-level programming abstraction and too low for most problems that analysts and “data scientists” encounter.

M-R is the assembly programming of big data. It is vital as the base level of the stack. Just as assembly is unproductive for general programming compared to python, ruby or <your-favorite-high-level-language>, M-R is too low level for doing significant analysis work.

PIG and Cascading (and other languages that build on top of M-R) are built with language constructs that match what analysts need to do:

load complex data

join multiple data sets

filter rows

project out columns

aggregate based on columns

apply functions to aggregates

Very few non-trivial analysis problems map effortlessly onto the map-reduce model. Most problems will require many M-R stages. This can make for brittle code that is hard to maintain. It might seem like you are saving effort by keeping the stack simple and using raw M-R or streaming through python, but productivity will usually suffer.

Core Ideas:

multivariate modeling is challenging

pair plots make it easy to get a quick understanding of each variable and the relationships between them

Multivariate analysis and modeling can be really challenging. Getting the job done well requires you to know your data really well. People often use the metaphor the you know something well if you “know it like the back of your hand”. However we look at our hands everyday but probably do not recall the details of where each freckle or wrinkle is. You want to know your data in a much more detailed way.

One very valuable first step when working with a new multivariate data set is to look at the relationships between each pair of variables. There are a number of ways to do this in R and I often prefer to use two different scatter plot matrix methods to get a feel for the relationships between the variables.

Here is an example using the mtcars dataset in R.

df<-mtcars[,c(1,2,3,4,5,6,7)]

Scenario(s):

getting to know your numerical data

predictive modeling (feature selection, technique choice,…)

psych::pairs.panels

why use it?

you can see points with an ellipse superimposed in the lower region

you can see the data distribution on the diagonal for each variable

you can see the correlation values in the upper region

works with categorical data

library(psych) pairs.panels(df)

corrgram::corrgram

why use it?

pie chart in the lower region gives a quick visual view of correlations

Based on these plots it is easy to see some important high-level relationships between the variables.

mpg is strongly inversely proportional to:

cyl : number of cylinders

disp: engine displacement

hp: horsepower

wt: vehicle weight

mpg is negatively proportional to:

drat: rear axel ratio

qsec: time to get drive 1/4 mile

rear axel ratio and weight do not have a strong relationship with the 1/4-mile time. This means that if you want to predict 1/4-mile time, you would not want to use these as unconditional predictor. In fact it might cause you to start looking for interactions between the variables so you can do conditional modeling.

rear axel ratio is inversely proportional to wt, hp, disp and cyl. I know nothing about cars, but now I know that heavier, more powerful cars tend to have a smaller rear axel ratio.

There is also a lot of great basic summary info here:

A distribution plot for each variable

The min and max of each variable

This still only provides a very superficial understanding of the data, but this is a good start. There are lots of different options and ways to use both packages, so you can adapt how you use these functions for your own style and preferences.

I’ve been a big fan of ggplot2 for a long time but plyr has been in my toolkit for less than a year and it is now one of my most-used R packages. It is how aggregate/*apply would have been if they were awesome.

In five lines this code computes the cumulative distribution functions of all of the variables in the iris data set and creates a colored, faceted plot to visualize the data.

It also provides perspective on the long history of publication bias and how it impacts results.

My own feeling is that a lot of this comes from the definition of publication. In many cases people are not toiling away on their experiments in isolation, they are blogging, tweeting, etc about their work. So it is possible that the publication phases is arriving early in the process because of pre-print archives, blogs and the like. Patents, tenure, new drugs, higher search ad PPCs are not the only rewards for conducting scientific experiments that yield interesting results. As Facebook/Twitter illustrate clearly, the social rewards for sharing your ideas and getting feedback are enormous for most people.

As someone who has conducted hundreds of massive experiments in the context of search engine product development, I am going to renew my efforts to be sure that I am not kidding myself and amplifying the noise.

This article is not reason for panic, just a good reminder that proving things is hard and we should always be careful.

I just started a new job (working on social search awesomeness at Bing) and so I had to set up my “dev” environment with all of my usual tools (R, python,vim,etc). One thing that made this a bit easier is my habit of keeping an R script around that installs all of my common packages for me in one shot.

This is also a nice way to share your list of favorite packages with your friends. Please feel free to share your list of cool packages.

R works in many ways and on many different OSes which is great, but it also means that if you share a piece of code the recipient may need to install packages to make it work.

One thing that I do (adapted from a trick my friend Paul Jin showed me) is use the following code block instead of just loading/requiring packages. This ensures that whenever I require a package it will be downloaded from the appropriate cran mirror if it is not already installed.