The Delusions vs the Effectiveness of Big Data

Recently, I came across two articles. One argued how the value of big data is overhyped. The other showed how you can exploit large amounts of data to make amazing discoveries. I found their contrasting viewpoints interesting. Below are some highlights from each.

Summary of “The Delusions of Big Data”

If you make many hypotheses using exponentially complex data without ensuring that statistical rigor is maintained, some of your inferences will likely be false. A database may have thousands of people in it. Each person may have millions of features. If you start looking at the exponential combinations of these features, there is bound to be a spurious combination that falsely predicts what you are looking for.

To prevent acting on bad predictions, we must understand the quality of our inferences. Errors must be quantified; error bars and rates are necessary. We cannot simply explore the data and make decisions without considering the quality of the experiment (noisy data, sampling patterns, heterogeneity etc).

Data science involves hard engineering and mathematics. It will take time to get things right. There is too much hype about what can be gained from big data. We will make steady progress over decades but no major leap in understanding will happen quickly.

Summary of “The Unreasonable Effectiveness of Data”

There are many more unstructured, unlabeled data resources than there are structured, labeled resources. Take the time to develop intelligent, unsupervised learning that exploits big data rather than using smaller, curated corpora.

Due to the large complexity of data and its relationships, use nonparametric models to represent your data. This will maintain a high resolution of the details of your data. It will allow the model to expand proportionately with the data’s complexity and size.

When processing natural language, make use of the data’s context to find established concepts. There are already many existing relationships to help you understand how data should be labeled and categorized. Use what is there instead of inventing new methods or concepts