Transcription

2 Abstract Big data is all of a sudden everywhere. It is too big to ignore! It has been six decades since the computer revolution, four decades after the development of the microchip, and two decades of the modern Internet! More than a decade after the 90s.com fizz, can Big Data be the next Big Bang? Big data reveals part of our daily lives. It has the potential to solve virtually any problem for a better urbanized global. Big Data sources are also very interesting from an official statistics point of view. The purpose of this paper is to explore the conceptions of big data and opportunities and challenges associated with using big data especially in official statistics. A petabyte is the equivalent of 1,000 terabytes, or a quadrillion bytes. One terabyte is a thousand gigabytes. One gigabyte is made up of a thousand megabytes. There are a thousand thousand i.e., a million petabytes in a zettabyte (Shaw 2014). And this is to be continued

3 Acknowledgments I would like to express my gratitude to my supervisor Per-Gösta Andersson for his assistance and guidance throughout my thesis. I cannot say thanks enough for his remarkable support and help. I would especially thank Ingegerd Jansson (Statistics Sweden) for guiding me throughout my thesis, directing me into new ideas and for making me motivated and encouraged. I would like to thank my examiner Sune Karlsson for providing valuable suggestions and corrections. I would also like to thank Panagiotis Mantalos for his caring attitudes and support. Furthermore, I would like to thank my partner, Kamal, for his love, kindness and support he has shown during my study which has taken me to finalize this thesis. I would also like to thank all my friends for their endless support. Finally yet importantly, I would like to dedicate this thesis to my beloved brother, Zhiaweh Danesh ( ), whom was a great economist-statistician and my life s greatest hero. He may rest in peace.

5 1. Introduction Data is everywhere. As the world goes modern, more and more data are being generated. Data are produced from phones, credit cards, computers, sensor, trains, buses, planes, bridges, and factories! The list goes on. Marc Andreessen excellently argued: Software is eating the world in his 2011 essay. According to Andreessen (2011), in the next decade, at least five billion people worldwide will have own smartphones, that gives every one of them direct access to the Internet, at any time (Andreessen 2011). Figure 1 shows the digital data created annually worldwide. Figure 1: Digital Data Created Annually Worldwide Source: Energy-Facts.org (2012). The amount of data and the frequency at which they are produced have led to the introduction of the term Big Data. Everyone seems to be curious about it and willing to collect and analyzing it (Jansson & Isaksson 2013). Big data is a data source with at least three features: extremely large volumes of data, extremely high velocity of data and extremely wide variety of data. It is important because it allows for gathering, storing and managing enormous amounts of data in real time to gain a bigger understanding of the information (Hurwitz, Nugent, Halper & Kaufman 2013). 1

6 The data is here; its challenges and the way to make it useful are known to be an IT problem rather than a statistical issue (Jansson & Isaksson 2013). Big data has been looked at from an IT perspective where the focus is mainly on software and hardware issues (Daas et al. 2012). IT-people designed new methods for processing, evaluating, and presenting the data which is called Big Data Analytics. The statistical offices are now also beginning to adapt the big data s problem (Jansson & Isaksson 2013). But the question is if the same statistical methods are applicable to big data sources and if big data will meet the goals of official statistics? The aim of this paper is to investigate the term big data and the opportunities and challenges associated with using big data especially in official statistics production. Three options have been proposed on using big data in official statistics production by Robert M. Groves 1 : ignoring big data as the first option, destroying all official statistical structures and replace them with big data as the second option, or combining big data with traditional bases. Grove came to the conclusion that the first two options are unacceptable and irrational. So the third option, using big data to improve or somewhat replace traditional data sources, is the most possible case (Jansson & Isaksson 2013). However the theory so far is that that by combining the power of modern computing with the overflowing data of the digital era, big data promises to solve almost any problem (Cheung 2012). This paper provides an overview of the concept of big data in general in chapter 2. Section 3 presents some big data case studies, followed by discussing big data in the world of official statistics in section 4. Some methods for inference and the selectivity problem are explored in section 5. Next chapter is exploring the dark sides of big data. The problem with data, privacy and analyzing big data are discussed. Section 7 is discussing the paper along with some conclusions. 1 Robert M. Groves s speech at the opening session of NTTS in March

7 2. Big Data The world today is oversupplied with information. There are cellphones in almost every pocket, computers at every home and offices, Wi-Fi everywhere. The scale of information is growing faster than ever before and this quantitative shifting has led to a qualitative one. The term Big Data was first coined in the 2000s by sciences like astronomy after experiencing the data explosion (Cukier & Mayer-Schonberger 2013) Definition There is no accurate definition of big data. Every paper on big data defines the phenomenon differently. There are various existing definitions of big data available, which usually include the three Vs: volume, velocity and variety. Volume refers to the data sets being large; much larger than usual. Velocity points to the short time lag between the occurrence of an event and analyzing it. It can also refer to the regularity at which data is generated. Variety indicates the wide mixture of data sources and formats: from financial transactions to text and video messages (Cukier & Mayer-Schonberger 2013). Figure 2 shows an expanding on the three Vs. Figure 2: The 3 V:s at an increasing rate. Source: 3

8 IBM has a forth V, veracity, in its definition of big data which takes account to the accuracy of the information and if the data could be trusted enough in order to make important decisions (IBM 2012). Gartner, Inc., the world's leading information technology research and advisory company, defines big data as: Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Cukier and Mayer-Schönberger choose the following definition of big data in their book: Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more. And statistical organizations regard big data as (Jansson & Isaksson 2013): Data that is difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made. An important factor, which makes big data different from official statistics, is that big data sources often contain information not necessarily directly related with statistical elements such as households, persons or enterprises. The information in big data is often a byproduct of some process not principally aimed at data collection, while survey sampling and registers clearly are. Therefore, analysis of big data is more data driven than hypothesis based (Buelens, Dass, Burger, Puts & Brakel 2014). Table 1 compares big data sources with traditional data sources such as sample surveys and administrative registers. Apart from the three V:s, three additional categories are listed. Records factor is looking at the information scale in which data are being observed and stored. Generating mechanism refers to how the data source is being generated. 4

9 The last difference listed in Table 1, fraction of population, refers to coverage of the data source in relation to the population of interest. The most important dissimilarity is between registers and big data; registers often has almost complete coverage of the population, while big data generally do not. In some cases of big data sources, it may even be indistinct what the target population is (Buelens et al 2014). Table 1: Comparing data sources for official statistics. Data source Sample survey Register Big data Volume Small Large Big Velocity Slow Slow Fast Variety Narrow Narrow Wide Records Units Units Events or units Generating mechanism Sample Administration Various Fraction of population Small Large, complete Large, incomplete Source: (Buelens et al 2014). There is one more category, which is not present in table 1. It is the error measuring for each of the three data sources. In survey sampling, all the sources of error such as sampling variance, non-response bias, interviewer effects and measurement errors are included in the concept of Total Survey Error (Buelens et al 2014). As for Big data, no complete approaches to error of budgeting or quality phases have developed yet. The bias due to selectivity affects the error accounting of big data, but on the other hand, there are some other features to consider. For example, the measuring instruments for big data sources differ from survey sampling, where the survey design and capable interviewer and well defined hypotheses are the key elements (Buelens et al 2014). 5

10 3. Previous studies Almost all previous studies about big data show the great opportunities that come with big data. Big data brings up the newfound facility to crunch a vast quantity of information, questioning it instantly, and even drawing shocking conclusions from it. Big data is a developing approach; it can translate numerous phenomena, all from the price of airline tickets to the text of millions of books, allowing to be searched successfully and by using our growing computing techniques discovers epiphanies that we never could have seen before. Big data is a revolution on the same level as the internet, it will change the way we think about many important matters such as business, health, politics, education, and innovation in many years to come (Cukier & Mayer-Schonberger 2013). Cukier and Mayer-Schönberger, two leading experts of big data, explain what big data is, how it will change our lives, and what we can do to protect ourselves from the hazards. Their book, Big Data, a revolution that will transform how we live, work, and think, is the first big book about the next big bang. Cukier and Mayer-Schönberger argue that the more data there is, the more useful it becomes. By analyzing sensitive facts about 100 million of observations rather than just one or maybe dozen or a small sample, diseases can be cured, elections could be win, billions of dollars could be earned and much more. The authors believe that by analyzing huge amounts of data, more patterns and relationships are possible to discover, patterns that are mostly invisible when using smaller amount of information. These integrations will guide us to new solutions and opportunities we would never otherwise have alleged. Cukier and Mayer-Schönberger write about many examples. One example involves the store Walmart and the notorious breakfast snack Pop-Tarts. Walmart decides to record every purchase by every customer for future analysis. After a while, the company analysts observed that when the National Weather Service warned of a tornado storm, the sale of Pop-Tarts rised significantly in Walmart stores in the affected area. Therefore, store managers put Pop-Tarts near the entry of the store during hurricane season, and sales flew. This is big data at its coolest. No one would have guessed the linking. The power tracking company, Efergy USA, is a big seller of monitors and hardware that connects to fuse boxes via wireless. The monitor shows the energy consumption up to 255 days in the past. It calculates hourly energy usage, the consumption trends and the price! 6

11 According to Juan Gonzalez, president of Efergy USA, It makes you realize when you re using too much electricity and see how you can reduce. Their system could be set to alert letting the customers know when they reach their target consumption. This way, it can be easier to save on electricity bills (Wakefield 2014). In Efergy s case, big data makes it clear to see what is happening on a larger scale and find solutions. For example, in a case where a customer wants to cut down the energy bill, he or she can see where the cost can be cut. The data collected also shows the client s peak hours. When you put data in a larger context, which is big data, it allows them to help make more sense of that information and make it more actionable, the only way we can detect all these things in our home is looking at many homes and developing an algorithm to determine the connection., states Ali Kashani, co-founder and the vice president of software development at Energy Aware, an energy monitoring business (Wakefield 2014). Cukier & Mayer-Schonberger discuss about how cheap and easy is to store gigantic amount of information nowadays, which once was impossible. As a result, we now can record almost everything. The authors also explain as well that simply throwing more data at a problem can create remarkable results. Microsoft Corporation found that the spell checker in word processing softwares could be highly improved by having it process a database of 1 billion words. Google Inc. boosted its language translation service by using the Internet for billions of pages of translated papers and analyzing them. Amazon.com used the customer s individual shopping preferences to suggest new books to each customer by using computers to analyze millions of transactions, which was not only a cheap method, but also gave excellent results (Cukier & Mayer-Schonberger 2013). Why? Who knows? Knowing what, not why, is good enough the authors focus. Big data analysis does not care about causality but correlation. It often uncovers surprising results. However, computers do not care, therefore statistical methods are required to unveil the hidden connotations (Cukier & Mayer-Schonberger 2013). In 2008 Wired magazine s editor in chief, Chris Anderson, stated how inefficient it is to use the scientific method due to big data. In his article, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, he claimed that with enormous amount of data the scientific method would be out of date. Anderson emphasized that observing, developing a model and formulating a hypothesis, testing the hypothesis by conducting experiments and 7

12 collecting data, analyzing and interpreting the data, are all going to be replaced by statistical analysis of correlations and without any theory. He argues that all the old models or theories are invalid and by using more information, the modelling step could be skipped and instead statistical methods could be used to find patterns without making hypotheses first. He values correlation over causality (Anderson 2008). Anderson wrote: Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. Data analysis is inspiring, but not perfect. Cukier and Mayer-Schönberger open the book by writing about Google s Flu Trends service, which uses studying of billions of internet searches to estimate the odds of flu in the United States. However, even this overhyped technique failed completely, when the estimate of flu cases was twice the actual number. In addition to finding trends, big data analysis is getting better and better at forecasting performance, the book points out. Police uses the technology to put patrols at certain times of day on certain streets in some cities around the world and in some states of USA. It is also used to decide which prisoners are too dangerous to release and which will be released conditionally (Cukier & Mayer-Schonberger 2013). As with every great opportunity, there are some drawbacks too. Big data can be rather creepy. The authors discuss the issue of privacy. In chapter 8 of the book, they argue that big data destroying privacy and intimidating freedom. Cukier and Mayer-Schönberger come to an agreement on the following chapter about how the elevations of big data can be seen without losing privacy. Their sparkling book leaves no doubt that big data is the next big thing! 8

13 4. Big Data and official statistics "Big Data is an increasing challenge. The official statistical community needs to better understand the issues, and develop new methods, tools and ideas to make effective use of Big Data sources" (UNECE ). Apart from building new opportunities in the private sector, big data could also be a very interesting input for official statistics; either used on its own, or combined with traditional data sources such as sample surveys and administrative registers. Otherwise, the private sector may benefit more of the big data era by producing more and more statistics that even beat official statistics. It is doubtful that national statistics offices will lose the official statistics characteristic but they could risk losing their position and importance as the time passes even with all precision, reliability and interpretability of the statistics produced in these national offices. However, selecting the information from big data and fitting it into a statistical production process is not easy (UNECE 2011) Big Data at Statistics Sweden Big data exists at Statistics Sweden, SCB. Statistics Sweden uses data from cash registers for calculating the Consumer Price Index (CPI) since The data comes weekly from more than 300 supplies. Jansson and Isaksson (2013), point out that the data is modified before arriving at Statistics Sweden. They emphasize that the big data that enters statistics Sweden, has been reduced in volume and comes in structured form and although the data is produced rapidly they come in fixed time intervals like once a week, or once a month and so on (Jansson & Isaksson 2013). According to Jansson & Isaksson (2013), big data at Statistics Sweden is used as a complement or as auxiliary data for traditional sources in order to get more exhaustive and/or cheaper data. The data is even used for modelling in some cases. Furthermore, they underline the fact that these kinds of data have not been used for direct analysis or rapid estimates, and either have they required a complete redesign or extra production systems so far. But they still vary from the traditional data sources. 2 United Nations Economic Commission for Europe. 9

14 The suppliers of big data in Statistics Sweden are either firms (stores that sell the goods of interest) in Sweden or companies providing sensor data and credit card information which are located overseas. Neither of these information providers is expected to consider the needs of Statistics Sweden. They are not even expected to report changes in their datasets which in long run might have thoughtful negative effects on the future ability of producing official statistics using time series data (Jansson & Isaksson 2013). Statistics Sweden is interested in future use of big data and therefore is taking part of an association called The Swedish Big Data Analytics Network 3. The emphasis is mainly to support the possibilities of big data by researches, enhanced substructures, ability structures and other key elements for future progress (Jansson & Isaksson 2013). In addition of their rapport, note Jansson & Isaksson, about the idea of using electricity as an accompaniment to housing statistics but there is a need for improvement before taking any action (Jansson & Isaksson 2013). The idea is similar to the use of electricity consumption data in Ireland. They estimate 1.5 million to 2.2 billion records monthly. It benefits to the improvement of household register, which in its turn leads to a better estimation of the electricity usage (Dunne 2013). The project included using the time series of electricity usage between July 2009 and January 2011 for around 6000 monitoring meters placed around Ireland. The goal of the project was to describe the electricity consumption behaviour and predict the electricity usage in Ireland (Silipo & Winters 2013) Big Data at other agencies Big data is a BIG issue for statistical agencies around the world, especially at Statistics Netherlands (CBS) (Jansson & Isaksson 2013). Statistics Netherlands investigated both the possibilities and the futility of big data. They have analysed 4 data from traffic investigation 3 The purposes are chiefly to: highlight the recent and increasing importance of advanced analysis of very large data sets in society and business, and the excellent position of Sweden to potentially be at the forefront in this area by leveraging national areas of strength in research and business development; to address the limiting factors that hinder us from realising this potential; and to propose national efforts for remedying these factors and creating a fertile ground for future businesses, services, and societal applications based on Big Data Analytics. (The Swedish Big Data Analytics Network 2013, pp. 2). 4 At almost 13,000 locations, the number of vehicles per minute, their speed, and their length was measured. All the data from all the locations during one day was used for analysis and it was concluded that, despite issues with missing data and noise, the data gave useful information about traffic flows and types of vehicles. Social 10

15 and from social media (Daas et al 2013). Population distribution and movement could be known by analysing mobile phone call activity data. However, representativeness of the data should be considered (De Jong et al (2012). It was after an earthquake in Christchurch, New- Zealand, in 2011, that data of mobile phone were used to observe the population activities following the earthquake. Those data made it possible to report the movement of the people in order to know where in the country help was most needed (Statistics New Zealand 2012). At the Nordic Chief Statisticians meeting in Bergen in August 2013, big data was discussed as one of the hottest subjects. It showed that the Nordic countries do not fully agree about the characteristic features of big data, and there is not a policy for big data made yet, but it is on everyone s schedule. The history of administrative data sources is way too old in Nordic countries which count as a valuable experience proving the usefulness of big data (Jansson & Isaksson 2013). There are as well a lot of big data discussions taking place at different levels in Eurostat and the UNECE (Jansson & Isaksson 2013). The Director Generals of the National Statistical Institutes within the EU acknowledge that Big Data represent new opportunities and challenges for Official Statistics, and therefore encourage the European Statistical System and its partners to effectively examine the potential of Big Data sources in that regard. Further, he adds: recognise that Big Data is a phenomenon which is impacting on many policy areas. It is therefore essential to develop an Official Statistics Big Data strategy and to examine the place and the interdependencies of this strategy within the wider context of an overall government strategy at national as well as at EU level (DGINS 2013). There are other aspects on big data. The plan is to adopt an action and a road map. There is a project going on during 2014 within the UNECE (Jansson & Isaksson 2013). media data were used to analyse the sentiment of the Dutch people, giving results that were highly correlated with official numbers compiled by traditional methods. A separate study of Twitter messages showed that the data contained a lot of noise. A number of methodological problems were identified through the above projects, but the data sources were still viewed as useful (Daas et al 2012). 11

16 5. Methods for inference According to UNECE (2011), big data has the potential to produce more appropriate and suitable statistics than traditional sources of official statistics. Official statistics has long been relied on survey data collections and administrative data 5, which is different from big data where most data are freely available, or with private companies. When the velocity of data generating process increases 6, administrative data becomes Big. Including relevant big data sources into official statistics process, National Statistics Offices, makes a higher accuracy and confirms the consistency of the output (UNECE 2011). As mentioned in the previous parts of this paper, big data is mostly unstructured, meaning that there is not any predefined model and/or it is not as the usual databases forms (UNECE 2011). Traditional indexes are predesigned with a limited search query where as big data does come in any form but structured and searchable. This huge amount of data of varying types and quality does not fit into neatly defined categories. The most common databases has for a long time been SQL, Structured Query Language, but the data-tsunami in recent years has led to something called nosql, which does not require the same demands as SQL databases. It accepts data of all types and sizes and makes the data into searchable form (Cukier & Mayer- Schönberger 2013). However, data-picking from big data and fitting it into a statistical production process is challenging (UNECE 2011) Selectivity In a finite population, a sample data is representative in terms of some variable if the variable of interest has the same distribution as in the population. All the other subsets are known as selective samples. It is much easier to work with representative subsets and they give an unbiased inference about the whole population but this is not the case with selective samples (Buelens et al. 2014). 5 Administrative data is one of the main data sources used by National Statistics Office (NSO) for statistical purposes. Administrative data is collected at regular periods of time by statistical offices and is used to produce official statistics. Traditionally, it has been received, often from public administrations, processed, stored, managed and used by the NSOs in a very structured manner. (UNECE 2011) 6 For instance using administrative data where data is collected daily or weekly instead. 12

17 One of the concerns that rise with big data is if it is representative. As discussed in part two of this study, big data is usually an infinite population and the reference population is not clear. The questions arises as what is the population, who generates the data and if we can draw a sample and achieve population properties? In traditional method and probability sampling, the focus is to get a representative sample for the population of interest. It is done with help of evolving a survey design that is expected to give a representative sample. Approximation theory in sample surveys is built on the representativeness assumption (Buelens et al. 2014). This assumption is invalid when using big data. As in big data, correlations may reflect what is happening but statistical inference are not possible to use (Cukier & Mayer-Schonberger 2013). There are some methods developed for correcting errors from representativeness, for example errors that caused by selective-nonresponse. The Generalised Regression Estimator (GREG) 7 is used at Statistics Netherlands currently (Bethlehem, Cobben & Schouten 2011). Classical estimation methods are essentially grounded on survey design, and are known as design-based methods. Unless the dataset covers the whole population of interest, it is uncertain that the data are representative when a data set collects through some other way than random sampling. Therefore, when using big data source in official statistics, the issue of selectivity needs to be considered (Buelens et al. 2014) Method Big data can be part of the production of official statistics. As discussed in previous part, selectivity of big data could pose problem depending on how the data are used (Buelens et al 2014). In the discussion paper, Selectivity of Big data, Buelens et al (2014), discuss four different cases where big data is used as information resources in production of official statistics. The first case is where big data are the only source of data used for the production of some statistics. With this background, well assessing and choosing of the data is crucial and the 7 A model assisted estimator designed to improve the accuracy of the estimates by means of auxiliary information. 13

18 more important is taking care of selection bias through choosing a suitable method of inference (Buelens et al 2014). Buelens et al (2012) argue the importance and power of the right method of inference that could overcome the problem of representativeness (Buelens et al. 2012). Model-based and algorithmic methods are designed to predict parameter values for unobserved parts, and are usually encountered in data mining and machine learning contexts (Hastie et al., 2003). Although selecting a proper method and validating its assumptions in detailed situations is not a straightforward task to do (Baker et al. 2013), but also there are limits in what can be achieved when correcting selectivity. The results will still be biased if particular subpopulations are fully missing in the big data set. According to Statistics Netherlands, none of the big data sources contains identifying variables and so far it has been impossible to link big data sources to register databases and therefore an assessment and correction for selectivity problem has not achieved yet (Buelens et al. 2014). Buelens et al (2014) consider using big data as auxiliary data in a process largely based on sample survey data as the second case where statistics based on big data are purely used as a covariate in model-based estimation methods applied to the traditional survey sample data. By doing so, the sample size reduces which in its turn leads to cost reduction and reduction of the non-response error. This idea was discovered when data from GPS tracking devices were used to measure connectivity between geographical areas. The degree to which an area is connected to other areas was found to be a good predictor of the variable in interest (in their case poverty). This means that big data in GPS tracks can be used as a predictor for survey based measurements. A risk with this method is the instability of the big data source over time, or the exhibition of sudden changes due to technical upgrades or other unexpected circumstances. This is a classic problem for secondary data sources that has even been observed in administrative data (Buelens et al 2014). Next case concerns the aspects of the big data application that can be used as a data collection strategy in sample surveys, for example the geographic location data collected over GPS devices in smartphones to movement s range, where only parts of data that have been selected by means of a probability sample, are observed (Arends et al. 2013). Schutt and O Neil (2013) claim that the smartphone and in-built tracking devices are replacing the traditional survey, but all elements of survey sampling and the connected estimation methods remain 14

19 valid. The size of data set collected in this way is not necessary big, but contains a number of properties of typical big data sets (Schutt and O Neil, 2013). Buelens et al (2014) mentions the fourth case as using big data regardless of selectivity complications. It is argued that the statement about the resulting statistics allowing bearing to an uncovered population by the big data source is false. However, such statistics may be of interest and may enhance the official publications of official statistics (Buelens et al 2014). It is also important to have in mind that the utility of Internet as a data source is not essentially a source for new statistics, but rather has the potential of improving existing statistics. There are some considerable problems, such as problems with double counting, sorting, causality, estimation and in particular representativeness (Heerschap 2013). Furthermore Buelens et al (2014) argue about internet searches being selective because not everyone in the population of interest uses the internet, and not all of them use Google as a search engine, and the most importantly that not everybody who looks for information does so through the internet or Google (Buelens et al. 2014). As the cost of collecting data and acquisition decreases quickly, the importance of big data will increase. Companies creating and implementing big data approaches an inexpensive gain. Big data methods need to find a place in official statistics and the focus needs to cover beyond using big data to answer known problems, to try to find out patterns that could help making decisions and opportunities that could never have imagined before (Parise, Iyer & Vesse 2012). Dunne (2013) suggests that organising big data into a large number of groups or pools of data could be a solution to dealing with big data streams. This way, the data are easily convenient in the traditional processing ways. The effective way to attain this is to know the volume and total of the groups being available, the capacities in which the data is processed and if it is necessary to keep the original data once processed (Dunne 2013). 15

20 6. Challenges There are some drawbacks attached to the promising valuable assets of big data. Questions about the analytical value and policy issues are raised as an effect of big data. There are concerns over the data being representative as well as its reliability together with the overarching privacy issues of using personal data (Cukier & Mayer-Schonberger 2013). Along with big data come computational challenges as well. Despite finding a way to generate manageable structured data from unstructured data, statistical analyses tolls such as R and SAS must be integrated to be able to process big data. Furthermore, there is also another reason to worry; it is the risk of too many correlations. If correlations between two variables are looked at over 100 times, there will be a risk of finding, unintendedly, about five false correlations that appear statistically significant 8 even when there is no real significant connection between the variables. Lacking careful control can seriously increase such errors (Cukier & Mayer-Schonberger 2013). Some of the dark sides of big data are explored in the following parts Data Along with big data comes a very old problem: relying on the numbers when they are far more fallible than we think (Cukier & Mayer-Schönberger 2013). Management and the ability to analyze data have always obtained high benefits and great challenges for systems of all sizes and types. Capturing information about customers, products, and services, are valuable for businesses. Indeed, a lot of complexity comes along with data. Some data are structured and kept in a traditional database, while other data are unstructured. For instance, it would be much easier if all the customers always bought the same products in the same way but that is far from reality. Companies and sale markets have developed with time and are complicated. To overcome the complexity of data, more product lines was added to the list and that was how the data become Big. Data difficulties are not limited to sale markets only. Research and development (R&D) organizations, are an 8 Type I error in hypothesis testing 16

Collaborations between Official Statistics and Academia in the Era of Big Data World Statistics Day October 20-21, 2015 Budapest Vijay Nair University of Michigan Past-President of ISI vnn@umich.edu What

Discussion Paper Selectivity of Big data The views expressed in this paper are those of the author(s) and do not necessarily reflect the policies of Statistics Netherlands 2014 11 Bart Buelens Piet Daas

The Intersection of Big Data, Data Science, and The Internet of Things Bebo White SLAC National Accelerator Laboratory/ Stanford University bebo@slac.stanford.edu SLAC is a US national laboratory operated

Distr. GENERAL Working Paper 11 April 2013 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (ECE) CONFERENCE OF EUROPEAN STATISTICIANS ORGANISATION FOR ECONOMIC COOPERATION AND DEVELOPMENT (OECD)

International Journal of Advances in Engineering Science and Technology 221 Available online at www.ijaestonline.com ISSN: 2319-1120 Big Data Introduction, Importance and Current Perspective of Challenges

Big Data + Predictive Analytics = Actionable Business Insights: Consider Big Data as the Most Important Thing for Business since the Internet Adapted from the forthcoming book, Business Innovation in the

Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

in association with The Big Facts about Big Data Love or hate the term, Big Data is here to stay. We run down the key facts and debunk the myths... You d have to have been living in the Big Brother house

Annex: Concept Note Friday Seminar on Emerging Issues Big Data for Policy, Development and Official Statistics New York, 22 February 2013 How is Big Data different from just very large databases? 1 Traditionally,

Financial Services Grabbing Value from Big Data: The New Game Changer for Financial Services How financial services companies can harness the innovative power of big data 2 Grabbing Value from Big Data:

[ WhitePaper ] PLA 7 WAYS TO USE LOG DATA FOR PROACTIVE PERFORMANCE MONITORING. Over the past decade, the value of log data for monitoring and diagnosing complex networks has become increasingly obvious.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON BIG DATA ISSUES AMRINDER KAUR Assistant Professor, Department of Computer

BIG DATA IN A DAY December 2, 2013 Underwritten by Copyright 2013 The Big Data Group, LLC. All Rights Reserved. All trademarks and registered trademarks are the property of their respective holders. EXECUTIVE

Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

Why your business decisions still rely more on gut feel than data driven insights. THERE ARE BIG PROMISES FROM BIG DATA, BUT FEW ARE CONNECTING INSIGHTS TO HIGH CONFIDENCE DECISION-MAKING 85% of Business

The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS 10 March 2013 WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS? At a High-Level Seminar on Streamlining Statistical Production

Harnessing Big Data to Improve Customer Service By Marty Tibbitts The goal is to apply analytics methods that move beyond customer satisfaction to nurturing customer loyalty by more deeply understanding

1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and

Paper for the Digital Enterprise Design & Management 2014. Professional contribution A journey from Big data to Smart data. By Fernando Iafrate Senior manager of the Business Intelligence & Data Architecture

The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

COULD VS. SHOULD: BALANCING BIG DATA AND ANALYTICS TECHNOLOGY The business world is abuzz with the potential of data. In fact, most businesses have so much data that it is difficult for them to process

Research excellence Sustaining Mind and the brand machine relevance with the connected consumer It may be immense, fast and mind-bendingly varied. But researchers must remember: Big Data can no more speak

ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data. 1 Advances in information technologies are transforming the fabric of our society and data represent

A Future Without Secrets A NetPay Whitepaper A Future Without Secrets The new business buzz word is Big Data - everyone who is anyone in business is talking about it, but is this terminology just another

Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE Michael Diederich, Microsoft CMG Research & Insights Introduction The rise of social media platforms like Facebook and Twitter has created new

EXECUTIVE SUMMARY Big Data is not an uncommon term in the technology industry anymore. It s of big interest to many leading IT providers and archiving companies. But what is Big Data? While many have formed

Understanding the impact of the connected revolution Vodafone Power to you 02 Introduction With competitive pressures intensifying and the pace of innovation accelerating, recognising key trends, understanding

CS 202: Introduction to Computation " UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department Professor Andrea Arpaci-Dusseau How can computation use data to solve problems? Congrats to Game Winners

Unlocking The Value of the Deep Web Harvesting Big Data that Google Doesn t Reach Introduction Every day, untold millions search the web with Google, Bing and other search engines. The volumes truly are

Turning Big Data into Big Decisions Delivering on the High Demand for Data Michael Ho, Vice President of Professional Services Digital Government Institute s Government Big Data Conference, October 31,

The Cloud for Insights A Guide for Small and Medium Business As the volume of data grows, businesses are using the power of the cloud to gather, analyze, and visualize data from internal and external sources

Techniques For Optimizing The Relationship Between Data Storage Space And Data Retrieval Time For Large Databases TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL

Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim

Community Driven Apache Hadoop Apache Hadoop Patterns of Use April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Big Data: Apache Hadoop Use Distilled There certainly is no shortage of hype when

Big Data More Is Not Always Better For Your Big Data Strategy Abstract Everyone is talking about Big Data. Enterprises across the globe are spending significant dollars on it and endeavor with their strategy.