Featured Blog Posts – October 2014 Archive (55)

The constant search for something bigger might be part of the American culture. However, big data is often critical: without real time credit card fraud detection - a big data application - no store would accept credit cards.

There has been a few people questioning the value of big data recently, and predicting that big data is going to get smaller in the future. While most of these would-be oracles are traditional statisticians working on small data and worried about their…

Summary: We’ve scoured the literature to bring you a complete listing of possible definitions of Big Data with the goal of being able to determine what’s a Big Data opportunity and what’s not. Our conclusion is that Volume, Variety, and Velocity still make the best definitions but none of these stand on their own in identifying Big Data from not-so-big-data. Understanding these characteristics will help you analyze whether an opportunity calls for a Big Data solution but…

Last weekend, I was waiting in New York’s Penn Station, when the public announcer gave the familiar “See Something Say Something” message. It took a minute to sink in, but I had to laugh. Midtown Manhattan IS suspicious and unusual activity.

Speaking of outliers

In practice, data is dirty and big data is filthy. Analysts munge, wrangle and clean their…

When we perform machine learning of type classification, the target variable is a categorical (nominal) variable that has a set of unique values or classes . It could be a simple two class target variable like "approve application? " with classes (values) of "yes" or "no". Sometimes they might indicate ranges like "Excellent", "Good" etc. for a target variable like satisfaction score. We might also convert continuous variables like test scores (1 - 100) into classes like grades (A, B, C…

As a long-term member of the Linked Data community, which has evolved from W3C's Semantic Web, the latest developments around Data Science have become more and more attractive to me due to its complementary perspectives on similar challenges. Both disciplines work on questions like these:

In this article, I compare two approaches (with their advantages and drawbacks) to compute a simple metric: the number of unique visitors ("uniques") per year for a website. I use the word user or visitor interchangeably.

I’ve been thinking a lot about data, where it comes from, and what it looks like. I can’t help it. I’ve been a data geek for almost 15 years. And I find data beautiful. Not necessarily in its raw form, mind you. Then it’s just messy and more often than not a pain to deal with, especially when it gets really, really big. But when smart, creative people start to clean it up and use it in different ways to find the hidden stories that make sense, it can help us learn things in ways that we…

Given the nature of the community, presumably many visitors already have a strong understanding of the nature of quantitative data. Perhaps more mysterious is the idea of qualitative data especially since it can sometimes be expressed in quantitative terms. For instance, "stress" as an internal response to an externality differs from person to person; yet it would be possible to canvas a large number of people and express stress levels as an aggregate based on a perceptual gradient: minimal,…

This happened tonight, shortly after Facebook took the same decision. Even Bit.ly itself is banned, see picture below. This happens only with Chrome, but not with other browsers such as IE or Firefox. The ban will probably be lifted in several hours.

For a very long time, businesses had their documents filed in folders and stored in huge metal cabinets. But thanks to advances in technology, they were eventually coded and stored digitally. As we advance through the Age of Information, the traditional digital storage devices like the floppy, compact, and flash…

Or to put it differently, when your metrics lie to you: how to find out, and what should you do?

The purpose of this article is to let Google aware of the problem, and fix their Google Analytics reports (filtering out the fake traffic). This scheme also impacts many companies computing website rankings. Tons of websites now have their traffic…