What would you say are the strongest three factors associated with the salaries of major-league baseball players? According to a popular, well-established algorithm, the main influential factors are walks, intentional walks and runs batted in.

How much does he earn?

But a paper recently published in Science reports on a new data analysis tool, which is able to find interesting relationships and trends in complex data sets – relationships that are invisible to other types of statistical analyses.

This could be a big deal: Large data sets with thousands of variables are increasingly common in fields as diverse as genomics, physics, political science, economics and more, so there is an increasing need for data analysis tools to make sense of such complex data sets.

It all started when Yakir Reshef, who is now a visiting Fulbright Scholar at the Weizmann Institute, was an undergraduate at Harvard University. Together with his older brother, David Reshef, then a master’s student at MIT, he became interested in large data sets containing relationships whose type is unknown. In a collaboration that began in North America and crossed the Atlantic Ocean as Yakir moved to Israel, the two developed a new algorithm that could discover unexpected yet important relationships that would otherwise go unnoticed.

The tool the two developed – under the guidance of advisers Michael Mitzenmacher of the Harvard University School of Engineering and Applied Sciences and Pardis Sabeti of the Broad Institute – is named the maximal information coefficient, or MIC, and it scores pairs of variables based on how closely related they are. Researchers can calculate MIC on each pair of variables in their data set, rank the pairs by their scores (the higher the score, the more related the pair), and then examine the top-scoring pairs – that is, the pairs that affect each other the most.

Associations between bacterial species in the gut microbiota of “humanized” mice

To test whether the algorithm actually works, Yakir and David worked with Ph.D. student Hilary Finucane, of the Weizmann Institute’s Mathematics Department (and while we are on the subject of relationships, Yakir’s fiancée). The three applied MIC to both known and novel data sets in global health, gene expression, the human gut microbiota, and – you guessed it – major-league baseball, and compared the results to those of current methods.

In one example, they examined data from the World Health Organization, covering 200 countries and containing 357 data variables per country. One interesting relationship they found was between female obesity and income in which obesity increases monotonically with income in the Pacific Islands, a finding that contrasted with results from other countries. Was this an anomaly they were seeing? On the contrary – obesity is considered a sign of status in the Pacific Islands. But while most methods would treat this separate trend as noise, MIC is able to identify relationships, such as this one, that include more than one trend.

The researchers explain that the attributes which set MIC apart from other data analysis tools are twofold: It assigns high scores to a wide variety of relationship types hidden in large datasets, while also being able to provide similar scores to relationships with comparable amounts of noise. In other words, they say, it can find “cool things going on” that are unexpected and therefore difficult to detect with other types of analyses.

So what about baseball? MIC results differ from that traditional statistic: Rather than walks, intentional walks and runs batted in, it places hits, total bases and how many runs a player generates for a team as the most influential factors. So, which of the statistics is correct? The researchers have wisely opted to step aside, leaving it to baseball enthusiasts to decide which of them are – or should be – more strongly tied to salary!

Comments

Is the claim that the algorithm was able, without assistance, to classify certain nations as belonging to Polynesia? If so, you really ought to be writing about that. If not, if it just depends on human intervention to assign some relevance to a correlation that is positive in some cases and negative in others, then it’s just spotting relationships within arbitrary subsets of a total population while ignoring the rest – in other words, a conspiracy theory generator!

Actually, it found a correlation between obesity and economic status in Polynesia in a large data set on global health. In other words, it identified a true (and known) trend that would probably have been discarded as noise in such a large pile of information using other statistical methods.