Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.

Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.

OK, let’s pick up where we left off. In my last post, we examined the first step in any big data processing engine – searching and filtering. In essence, the goal is to identify the relevant data from the irrelevant data (noise). If you’ve analyzed any form of big data, you probably noticed that the signal-to-noise ratio is pretty low. Most of the data are noise, and only a tiny fraction is the signal. The question then is, why do we need big data?

Depending on which camp you’re from, there are many answers to this excellent question. I will talk about three in this post, but keep in mind that there are many more. Just because you can track, store, and analyze big data, doesn’t mean you should.

The Uncertain Future

One of the most common arguments favoring big data is that data is versatile and doesn’t really have a shelf life. Even though you don’t need it today, its relevance and utility may become apparent in the future. And since you never know what you might need in the future, you might as well store everything that you can now.

This argument is almost tautological. That means it is irrefutably true no matter how you interpret it. Since the future is inaccessible (at least for now), and humans are risk aversive, we will always want to hedge against the unknown future. The only question left is how cheaply can we track and store these data? If it is cheap, this approach makes sense!

Although data storage is relatively cheap these days, there are hidden costs in big data initiative beyond the mere cost of hard drives. Since big data are so big that they cannot be stored in, nor analyzed on conventional databases, you need a completely new stack of technology for its capture, storage and analysis. This stack is known as the SMAQ stack (i.e. Storage, MapReduce and Query). One of the most popular SMAQ stacks is based on Hadoop, an open source implementation of Google File System (GFS). So the actual SMAQ stack itself isn’t expensive. The cost is the new talent that is needed to use this stack effectively so enterprises can derive insights from the big data.

Despite the fact that big data technology is relatively cheap, the total cost of ownership (TCO) of any big data initiative may still be quite high when you factor in the cost of human resources. So, big data is definitely an investment that may not be right for everyone.

Your Signal is My Noise

Let’s look at a different argument for big data. Although the relevant data is not big at all, the overlap between everyone's relevant data is also tiny. That means everyone's relevant data is quite different and there is very little overlap between them. What is relevant to me may be completely useless to you and vice versa. Likewise, your signal is probably somebody else's noise. Since we usually don’t know who will be looking at these data, we must store everything we can in order to better serve everyone.

The small overlap in relevance is most apparent in Data as a Service (DaaS) vendors like Social Media Monitoring (SMM, a.k.a. Listening Platforms). If you are a company or a brand using SMM, you are probably concerned with the conversation about you and your competitors. That is actually a very tiny fraction of the conversation on social media because there are conversations about hundreds of thousands of different brands out there. Every brand will be interested in the conversation about itself, and every brand will have a different set of competitors. Since no one knows which brand will subscribe which DaaS, DaaS vendors need to be prepared to serve all brands by storing all conversations on the social web.

Now, if you are not a DaaS provider (e.g. SMM or VRM) you might not need all these “big” data. For a brand, all you really need are the conversations about you and your competitors. There are several options for getting these data.

You can capture and store the data yourself

You can buy the data (with a big check)

Or you can subscribe to a DaaS provider and get these data with much lower cost

Maybe You Don’t Need Big Data

Both arguments above hinge on the fact that the precise use of the big data is unknown. We don’t know what questions we may need to answer, and we don’t know what data can help answer them.

Sometimes, however, we do know the questions we need to answer. In fact, we often have some very specific business questions with regards to social media. What is the ROI? Which technology is most engaging? Who are your most valuable influencers, etc. In these cases, you don't need “big” data. You just need the “right” data, the relevant data, the precise data that addresses your question! And that is usually a pretty small data set; sometimes it can even be loaded and analyzed on a beefy personal computer.

Conclusion

Alright, there are probably hundreds of reasons for and against big data. I’ve talked about three here, what are your arguments for or against big data?

Although there is little dispute to the utility of big data, collecting and storing these data by yourself may not be the most economical way to get it. So when should you start thinking seriously about your own big data initiative?

If you have access to the talent and can do it cheaply. That includes the talents to extract and analyze the relevant data in order to derive insights and value from it

If you are a DaaS provider and need the data to serve your customers

If you have specific questions, then all you really need is just the “right” data, which is usually not big at all!

Finally, a little preview for what’s coming. Without going too deep into the technical details, next time we will address the question of where did big data came from and how it got so big. See you next time!

I am with you on the reasons you mention about for big data. One other point I will add is that unlike a decade ago, firms are now forced to look at more and more unstructured data because of the social and mobile explosion (as opposed to a relational table of transactional information). This along with the need to merge structured and unstructured data is another possible reason to move into the realm of "big" data.

Having said that, I am also of the opinion that firms should not just collect data for the sake of collection. They should have a clear vision of what they want to do with that data and how it would help with their business objectives . While memory is cheap today and it is possible to store large amounts of data, processing large amounts of data is still a tricky issue and adds complexity to the analysis.

Great post!! Another relevant data point before you begin is try to at least understand the scope of the return of the initiative. For example, if managing big data on a plant floor at a manufacturing company can drive significant operational improvements that drive improved results the reason for such an initiative may be more clear. If you are uncertain what the benefit is of big data in your social media, matbe you can defer the decision until the *reason* you are doing it and expending all the resources is more clear...

This post was very helpful. There is so much hype about Big Data that this post helps better understand the cons of Big Data that I have not seen much. It seems like small companies have to do this as *aaS with platform, infrastructure, analytics and data. I might have missed something. Perhaps you can give us some idea.

You are absolutely right. In fact, the next article in this series will examine what characteristics will cause the data to grow rapidly and eventually become big data. Structure or unstructure is actually not that important. There are plenty of unstructure data that are very small. Of course they can be big, in the case of social media or mobile data. Structure vs. unstructure is not the reason why big data is so big. And we'll talk about that nex time.

I think most firm do want to make use of their big data if they have them, but they just don't know how. My feeling is that they just don't have the right talent. data scientist are very different, a support vector machine expert and a markov randome field analyst can solve very different problem, but unfortunately they are all labeled as data scientist in the industry, because the industry don't know how to tell them apart.

What you said is very true. However, in most realistic scenario, it is very difficult for anyone to predict, or even make an reasonable guess at the ROI of big data (or any data for that matter). In most situations, you won't know the ROI until you get the data, analyzed it, and use the insight to change or optimize parts of your busines, and measure the outcome again.

So I see big data is an investment. You have to just try it. That is why in this article, I say that, if you can do it cheaply, then definitely do it. Otherwise, buy it, subscribe to a DaaS to get the data you need, in most cases it will be cheaper.

Realistically, if you want to wait until the benefit is clear, then you can pretty much wait forever.

Thank you for commenting on my blog. I'm very glad to hear that you find this blog useful.

There is definitely hypes, but I would say that big data (or just any data in general) probably have tremendous latent value. They just have to be extracted with the right searching and filitering technologies, and then analyzed by the right statistical techniques. This may not be so easy.

I would say that for smaller/mid size business (SMB), it may be more economical to just subscribe to a DaaS to get the data they need. Large enterprises have the economy of scale on their side, but the up-front cost of big data may still be too expensive for SMBs. If you only want the data, why pay for the whole data collection mechanisms, storage infrastructure, analytics platform, etc. DaaS is a more economical approach, because all the SMBs share the cost of the data infrastructure.

But as data product become more prevalent, this may change, and SMB would one day be able to aford the cost of the data infrastructure themselves.

Thanks for the response. I agree with you that it is not purely structure vs unstructure scenarios that ends up in big data.

What I meant to say was that as compared to the environment of yesteryears (a) there are more sources of unstructured data that is increasingly becoming critical to doing business (social and mobile being two of the most important ones), and (b) the sheer frequency and volume of these "unstructured" interactions along with the need for a multi-channel outlook corners organizations into looking at big data and finding ways to reduce and extract insights from it.

We've already had the chance to duke it out on twitter! I'll post more on this in another comment later, and then post on my blog within a few days, but I thought it would be a good idea to kick things off tonight.

First let me reiterate how enjoyable our twitter back-and-forth has been--I hope we can meet in person sometime!

I think we're at odds over what we mean by "noise" and "irrelevant". I admit I was thinking of noise in the way most data miners or statisticians in customer analytics think about it, namely, "unexplained variance". In this case, there is information in the data, but we can't explain it because we don't observe what is causing the data to appear as it does.

You have brought memories flooding back over the signal and image processing view of noise which clearly is where you are drawing the "irrelevant' label. In this case, the noise truly is uninteresting and just gets in the way of identifying the signal. In signal processing then, we try to filter out the noise to get at the signal; if the noise is linear, just simple LMS filters suffice. but if the noise is nonlinear, not random (white), then it can take a lot of samples to differentiate signal from noise using nonlinear filters. Michael, you know all this far better than I do, but I am just reiterating for those who don't think about data this way. In these cases, we typically use time-series techinques to filter the noise, whether simple filters, Kalman Filters, or neural networks.

OK, back to big data, noise, and irrelevant data. (I can't help but continue!) Let's take an example with lots of data, such as credit card transactions. Millions...no billions of transactions. big data. lots of customers. My heartburn over the irrelevant label is that I was thinking about these kinds of applications. For credit card fraud, *all* of the data is important: both the normal transactions and the abnormal (suspicious or fraudulent) transactions. Even though we are focused on the fraud cases, we must have the normal, regular transactions so we can differentiate normal from abnormal. Even stratified sampling can cause problems because it can distort the interactions between measured fields. So when I was arguing that no data is irrelevant, I was really arguing that all transactions are needed: some to represent normal, some to represent abnormal.

That stated, in modeling projects I've worked out, there are always some examples we toss because they are either:

1) kinds of transactions we'll never see again,

2) have significant errors in the transaction data itself

3) represent transactions from cards that we don't want represented in the data (already under investigation, for example)

In these cases, these transactions are for my purporses irrelevant because they are not representative of what my deployed models will be seeing, and therefore I don't want my predictive models to be biased by them or even care about them. However, I wouldn't characterize these records as "noise" really, though in a sense they are like the "noise" or "cluttter" in signal processing in that they are uninteresting.

I hope this helps to describe different views of "signal" and "noise" depending on the type of problem we are trying to solve.

I just wanted to follow up on our Twitter discussion from yesterday (Thanks to you and Dean BTW). I also commented via Twitter to Dean this AM about relevance but wanted to elaborate on that as well.

I started my career with "Big Data" well, in 1998 we thought terabytes were big, at a firm that had Discover Financial Data as a client so I got to "cut my eye teeth" on transactional and credit risk data at it finest. That was many moons ago but things haven't really changed that much in the analysis of data.

Let's go back to the basics of data and statistical analysis, you don't just pull relevancy out of a hat, it is decided with correlations, homoscedasticity, heteroscedasticity, goodness of fit, experience...etc. If you find a variable has no significance, then it becomes irrelevant to the task (as Michael says, you may need it for another task). I added experience to what would be considered mathematical why, because even with all the tests we can run there are things that will stand out or jump off the page as I like to say, something that mathematics wouldn't have picked up on.

Example, everyone talks about data but no one talks about quality of the data, "cleansing" data and ensuring the metadata is correct. I have ran tests that gave me results that didn't make sense, my gut told me something was up, not mathematics... I go back and see a column has been shifted over in a percent of the data and that I have characters in my numerics or visa versa, long story short..... What is your hypothesis, what variables do you have to work with, are there correlations or redundancies, is the data clean, (garbage in, garbage out), is there metadata, missing data, correct format of data (char vs num), can I do a RANUI (can I take a smaller sample size to work with to expedite the assignment), what systems am I on, what load can that system take, what tools do I have at my disposal, what is my budget, structured vs unstructured, parsed or not....

There are many decisions to be made each time you sit down to run an analysis and what is deemed important and relevant data with this task may not be the same on the next. You never want to throw the baby out with the bath water but yes, there may be some data that just isn't useful (noise) for "this" task.

Thank you for your post and the lively discussion that followed, have a great data day!

@Dean - Having been in the analytics field for quite a long while I can see your pov and I think it is relevant. But I also see this as a semantic difference of how one is defining noise for their analysis.

One way of tackling 'noise' in general (imo) is by taking a funnel approach. Look at data and remove random fluctuations and errors and then drill-down to removing irrelevant (but "good" data) - the latter more in line with Michael's filter concept.

Since my company is a Big Data database provider and I'm not a consumer or analyzer of Big Data

I find both arguments of signal and noise equally compelling. My only contribution here is to agree that even though disk storage is much cheaper than it has ever been, and technologies such as Hadoop can leverage MPP to scale and analyze extreme data sets, data growth and the desire to keep more data will always outstrip capacity. The only way, in our opinion, to cope is to de-duplicate and compress the data and therefore limit its physical growth while sustaining the ability to query and access it without penalty or loss of content.

Thank you all for the engaging and stimulating conversation. Since all of you are in this conversation, I will just write one long reply here. It’s very fun. I wish we have of this type of discussion, and I wish I will have more time for it.

;-)

First, let me try to pitch in with my perspective about noise. I actually thought about it when I was doing stats and machine learning way back. And I think I have a way to reconcile the different perspectives. At first, I was going to write it as a full blown blog article by itself, but since you all responded here, I’ll just continue the discussion here. The two different perspectives are:

(A) If we focused on the problem, then anything that does not help us address the problem is not relevant to the problem at hand. If the problem is a fraud detection, then yes, you need data from both the normal as well as the fraud activities. Fraud detection is equivalent to the statistical problem of outlier detection, which is a subclass of problem under umbrella of density estimation. And density estimation does require a lot of data.

But there are still lots of irrelevant data for this fraud detection problem. Any data that would contaminate the underlying distribution of normal behaviors (or the distribution of fraud behaviors) that are not classified as fraud would be irrelevant. As Dean mentioned, these may be (a) transactions you will never see again, (b) known errors in the data, (c) data that we don’t want to represent the underlying distribution, both the normal behavior and the fraud behavior. If we take perspective (A), these would be considered noise (with respect to the problem).

The problem with perspective (A) is that for a different problem, these unwanted data (noise), may all of a sudden become totally relevant (signal). So what is noise (irrelevant data to the problem) changes as we switch from one problem to another as I described in this article.

(B) Now, if we are focused on the model, everything changes. If we spend years building the model, training it, etc., and we believe our model is correct. It’s elegant and predictive, and has everything we want in a good model, then it is natural that we take perspective (B). If the model is perfect, then anything variation in the data that cannot be predicted by the model must be statistically random. And that is why statisticians call these noise.

The trouble with perspective (B) is that if you alter your model parameters, or use a completely different model, you will alter what the model can predict, as well as, what it cannot predict. So the unexplained variance changes as we switched models. If we use a simple linear model, the noise may be huge, and if we use a more complex models (e.g. neural networks, support vector machines, random forest, boosting, ensemble classifiers and regression, etc.), the model may fit the data better. Consequently, the noise becomes much smaller. So what is noise (unexplained data variance with respect to the model) in one model may become signals in a different model.

Being a mathematician/statistician and using machine learning for almost 10 years now, I rarely come across a model that I considered perfect. As Carla mention, situational constrains (e.g. time, funding, etc.) often forces premature termination of the model refinement exercise. However, I believe that all models can be improved until we get into the overfitting regime. Even the random data variations in the data that are often treated as noise can be model explicitly as a noise term in the model. That is why I often treat unexplained variance in data as a combination of two things (a) failure of the model, (b) poor assumption about the noise generation mechanism.

OK, I hope I’ve reconciled the two different perspective on “noise.”

Now let’s talk about the less heavy stuff.

Dean, Carla and Ned, I enjoy very much these type of academic discussion and debates. That is how we can all learn from each other and understand the different perspectives out there. Unfortunately, twitter is just not a conducive environment for these in depth discussion. Imagine repeating this discussion 140 characters at a time. That is why I really appreciate that you are willingness to carry the conversation here on my blog. It’s awesome! And I am usually very responsive on my blog (unless I’m traveling).

As Dean mention, I would love to meet all of you in person too. In fact, Ned actually met me couple years ago while visiting San Francisco (SF). And we all started with these academic debates on my blog here on lithosphere. So please ping me next time if you visit SF, if I’m in town, I’d love to get together.

Carla, I totally agree that experience is huge. I was using machine learning to model how our brain process visual information for my PhD. And exploratory data analysis is always the first step. Basically it is playing with data. Our brain is able to pick up obscure trends and minute patterns better anything. In fact, that is how I spot real practicing data scientist from talking heads full of theories. That is not to say that theory is not important. I do a fair amount of theoretical work myself too, but actual experience with different kind of data is what make the different between a good data scientist and a great one.

Alright, this reply is getting very long. I hope enjoyed reading it as much as I wrote it.

Any discussion is ALWAYS welcome here. So please chime in if you are intrigued.

First of all, I'm a scientist, not an industry analyst, so I don't really like to comment about anyone's technology publicly. I believe whatever the technolgy is, the market will figure out what is good and what is important and outcompete the marketing types. So I always go back to the basic principles.

Compression is definitely not a new idea. In fact, many columns store technologies do compress the data in the column because they are of similar type, and compress better. It is not a bad idea. But I certainly don't think that is the ONLY way to deal with data explosion. I can understand how you believe that is the case since that is your business.

But I do believe that as computing power increase, compression will just be a transparent layer between storage/transmission and computation/presentation. Basically everything will be compressed unless there is a need to use the data in the uncompressed format (e.g. for computation, presentation, etc.). It will just be something so standard that we don't even have to think about it.

Thanks for a detailed response. Like you I too enjoy these exchanges (I guess once an academic, there is a part of you that is always 'research' oriented) and learn a lot from it.

I found it interesting that you took a position on noise in terms of problems and models. I agree with your statements relative to these. However, I feel (unless I misunderstood) there is also a third perspective here.

Noise related to both ' problem' or 'a model' are context specific as you mention. Noise for one problem can be a signal for another. Similarly, noise for one model can become signal if the model is tweaked or altered or for a different model.

The third perspective I am talking about is noise that is always irrelevant and meanignless irrespective of the model used or problem analyzed and so has to be either removed or corrected (Of course, one could argue that the problem might be estimating the noise itself in which case the noise is the 'signal'). One example I can think of is in text analytics where errors, degraded inputs and speech disfluencies can make the data/input noisy. These need to be addressed before any model building is undertaken.

I guess one could say that the problem is "analyzing the text" and thus classify it in your first category. The reason for me separting this out is because we could address multiple problems by looking at the same 'textbase' and there can be noise that is underlying to all of them.

Would you agree? Or would you classify this under one of your two categories?

However, personally I don't think there are data that are completely useless (i.e. data that are always noise) irrespective of the model or the problem.

Even for the cases that you mentioned where the data are either captured or stored with errors, those data could still be very useful for the problem of understanding why these errors occur. And I'm not making these up, these problems are very real. Engineers often need to analyze error data in order to improve the instruments or devices for data capture/storage.

Mathematically, the set of all possible models and the set of all possible problems are both infinite. And they can pretty much cover everything. So, even though I have not gone through and write down a proof for this, I suspect that noise with these respect two perspectives are pretty general.

So as you already suspected, this would be subsummed in perspective (A) in my reply above.

Why not leverage the endless hours of work, progress and improvement of big data output captured in thousands of reports used daily? Some of those reports have been audited; others reviewed by multiple functional experts derived from complex systems, web, PDF, etc.

There is a place for all methods but the report mining method could help direct more funds to complex big data projects while getting immediate ROI from a report warehouse or ad hoc use.

For example why capture all the detail financial transactions that will contain countless errors when you can have them correctly classified, accounted for and analyzed in reports that has been reviewed by experts or audited. I’m sure this could apply to not just operational functions but other data projects.

Thanks for this blog, I enjoyed very much your explanation about noise, by the way.

I'm very interested in knowing more about this, because it also articulates succintly the importance of context on the searching and filtering (if you do have a set of questions) and the analysis and interpretation (if you do not, and you have to formulate them once you have the pool of data at your hands).

I agree with you on the need to invest on the possibilities of data even before the interested parties may know what to do with it, but in my work environment I've also seen a lot of misticism with regards of Big Data (or any kind of data) in some audiences.

For what I have seen, I think it is easy for business people to get trapped in the promise and rhetoric of it, and pursue some initiatives for which they lack the specific questions AND the technical power to 'just explore'. Sometimes organizations do have the hard technical power, but there is no effective communication between that group of people and the people that could shed some light on the signals that could be right for a specific context.

I hear a lot (and I agree) that we have to foster data literacy for the whole organization, but until we get there I'm afraid it will continue to be 'magic' for some people, with the inevitable crash when they see they are not getting 'tangible results' once the initial excitement is over. This disenchantment might undermine further data initiatives, and that would be a pitty.

I will stay tuned to your writting and the discussion of the community.