Big Data - A Challenge or a Fad?

Posted in Operations & IT Articles, Total Reads: 1372
, Published on 18 November 2013

Advertisements

Sherlock Holmes said to his friend Watson: ""How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth? (Sherlock Holmes in The Sign of the Four). Watson played second fiddle then and now, also plays second fiddle to Citigroup.Watson is an IBM computer. The road to fame was his winning the quiz show Jeopardy, defeating Brad Rutter (all-time money winner on the show) and Ken Jennings (longest championship run-74 wins). Citigroup is using Watson to profile customers for creditworthiness and in future aims to use it for fraud detection. Though a seemingly simple task but what is the percentage of frauds, bankruptcy in the millions, billions of transactions that take place every day across the offices of Citigroup or any other financial institution?

Answer is in single digits!!!

image:renjith krishnan, freedigitalphotos.net

The exponential growth of data particularly in the last two decades, most of it being unstructured has given rise to a phenomenon known as analytics in layman terms. The format of the data makes it more difficult to profile customers, identify patterns, make recommendations etc. The tsunami of data that hits us every day considering the variety, size and velocity is mind boggling. Consider this, Youtube.com receives 24 hours of video every minute, Facebook has 700 million users and counting, 400 million tweets per day. This has spawned a wide range of companies to flourish from providing simple analytics to doing complex value added analytics for a wide range of customers across different industries.That is, reason enough to consider Big data as fourth factor of production after labor, capital and land. After the wave of social media in IT space, it is the Big data that is being hounded out to be the next big thing to happen. Big data is all about complex bytes ranging in from Giga to Peta and soon to Zeta. Once this structured, semi-structured or unstructured data is rearranged, classified for further inputs it is relatively easy to process it. Next, step is to analyze with the existing computing power.

A Challenge or A Fad??

The nature of big data is ever changing due to revolutions in data storage technology and the myriad sources which have grown over the years to make this world a global village. What could be stored on a hard disk last year can now be stored on a USB drive. The nature is inherently ever-changing more so because of the various sources from it is generated. The various sources can be categorized as:

1) Online transactions

2) Social networks

The data generated by these sources pose many challenges while processing and analyzing them for insights. These challenges can be best summarizes as below:

1) Scalability of resources:

The speed with which data is being generated far exceeds the scale at which computer resources are increasing i.e. clock cycle frequency increasing , it is the increasing number of cores that built in a processor ( a common feature in laptops, mobile devices). The past was all about following Moore’s law regarding increase in the computing power, that speed seems to have changed. Earlier it was relatively easier to buy storage by buying more storage drives but the cost has not gone down. Also, the data storage devices like Hard disk drives(HDD) are being replaced with Solid state drives(SDD) which impacts how we process the data in terms of appropriate algorithms, database design. The more data we store it gives rise to the issue of data quality. This has given rise to Master data management (MDM) but large data sets still have to be processed and analyzed and characterized.

2) Speed of analysis:

As the data volume grows it takes more time to process and analyze it. The speed of processing the data is inversely proportion to the speed at which data is generated. This assumes more significance if the results are to be provided and maintained real time. For example: to determine a fraudulent transaction in real time, the structure of analytics design must be robust enough to identify such a deal. However, analytics design is based on past data which may or may not be able to classify the current transaction. This leads to making changes to the analytics structure very often. As we change, temper with the design which becomes more complex, the speed of the processing raw information into a valuable insight decreases. Therefore, designing models for analytics becomes more challenging as the data volume grows.

3) Un-structured format:

Big data consists not only text but also images, videos. The challenge here is to extract valuable information from data which is increasingly getting dis-similar. For example, from a transaction database about a customer of a bank- the variables are age, gender, pin codes, amount, time, image of the product he bought, loans, religion, insurance contribution. How to design an analytics design so that image of the product and amount paid could be correlated to predict customer expense behavior. Of course there are advanced algorithms but the problem is compounded if it is extrapolated to millions of customers indulging in millions of transactions for every day of the year. Besides, we may not have some values of variables for every customer it makes prediction a lot more challenging. For some we may have amount, time, image of the product he bought but not for every customer. This is further made interesting when customers switch banks and their data is not part for the future development of prediction model.

4) Socio-Privacy issues:

With Google recently collapsing its individual privacy policies (60) into one (1) privacy policy, Facebook changing its privacy policy at every security breach brings to fore the problem of harvesting customer data without his/her acknowledgement.

5) Skills:

Though analytics industry started with the advent of Internet. It has acquired significance only in the last 5-7 years. This industry demands specialized skills coupled with expertise to interpret such voluminous data. The major skills required are:

• Statistics

• Programming(particularly, machine learning techniques)

• Operation research (particularly, optimization techniques)

Such skills are in short supply today and hard to find in one individual. This demand-supply mismatch needs to be addressed to enable companies to gain insights from the data they have stored at a cost.

A Fad :

The large volume of data which is increasingly unstructured and we have to mine it to break down into few relationships between all the variables. This is data mining and big data is just another word to describe it. As with all the technologies spawning books, companies to drive their profit north;Big data is nothing but a marketing Gimmick. It encompasses all the data mining techniques which existed earlier separately into one body of knowledge. The only indisputable fact is that we have larger amount of data than ever before and the computing power, technologies have not progressed at the same pace. With the evolution of better data storage technology and data processing techniques Big data would soon be Small data. The Global economic downturn is one of the main contributing factors to the hype surrounding Big data. Organizations around the world are trying to gain insights from the treasure trove of data they have stored and improve their business practices. They are trying to extract valuable information from the data using complex analytics. The downturn has forced companies to look inwards to improve their business efficiency and optimize their productivity. To do so IT firms are pushing for Big Data services as the way forward as the customers look to decrease their IT spending. The following reasons explain that “Big Data” is nothing new and we have been doing it since the advent of the internet:

1) It is about analyzing Peta amounts of data. Earlier it was about storing and analyzing Giga bytes of data. With the advent of new technologies like HADOOP, Lucene, Facebook’s Cassandra etc. analyzing such voluminous data has become easy.

2) For the analysis to be competitively insightful, the analyst must have expertise in the domain. This would help him understand the problem at hand and may also, throw solutions which the expert group did not think of.

3) As the amount of data grows the predictions become more accurate. It is true but the data analyzed should be related to the question or problem one is trying to solve. Adding millions, billions of unrelated data misleads and even be harmful to organizations growth, lest it is adopted as a solution.

4) Since, everyone has access to its organizations data, the same analytic tools, solutions to the same problems within an industry group, solutions are also projected to be same. Thus, overtime big data loses it competitive advantage and hence, its value.

The article has been authored by Mudit Gupta, Institute of Management, Nirma University