Ian Ayres of Yale University Law School talks about the ideas in his book,Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart. Ayres argues for the power of data and analysis over more traditional decision-making methods using judgment and intuition. He talks with EconTalk host Russ Roberts about predicting the quality of wine based on climate and rainfall, the increasing use of randomized data in the world of business, the use of evidence and information in medicine rather than the judgment of your doctor, and whether concealed handguns or car protection devices such as LoJack reduce the crime rate.

Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data. Hadoop is a popular open source project that not only has incorporated an implementation of the MapReduce programming model, but also includes other subprojects supporting reliable and scalable distributed computing such as HDFS, (a distributed file system) and Pig (a high-level data flow language for parallel computing), along with others. See http://hadoop.apache.org.

The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is really hard.

• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.

MapReduce

MapReduce is a programming model introduced and described by researchers at Google for parallel computation involving large data sets that are distributed across clusters of many processers. In contrast to the explicitly parallel programming models typically used with imperative language such as Java and C++, the MapReduce programming model is reminiscent of functional languages such as Lisp and APL, in its reliance on two basic operational steps:

• Map which describes the computation or analysis to be applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, and

• Reduce, in which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.

Conceptually, the computations applied during the Map phase to each input key/value pair are inherently independent, which means that both the data and the computations can be distributed across multiple storage and processing units and automatically parallelized.

A Common Example

The ability to scale based on automatic parallelization can be demonstrated using a common MapReduce example that counts the number of occurrences of each word in a collection of many documents. Looking at the problem provides a hierarchical view:

• The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document;

• The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph;

• The total; number of occurrences of each word in each paragraph can be computed as the sum of the occurrences of each word in each sentence;

This apparent recursion provides the context for both our Map function, which instructs each processing node to map each word to its count, and the Reduce function, which collects the word count pairs and sums together the counts for each particular word. The runtime system is responsible for distributing the input to the processing nodes, initiating the Map phase, coordinating the communication of the intermediate results, initiating the Reduce phase, and then collecting the final results.

While we can speculate on the level of granularity for computation (document vs. paragraph vs. sentence), ultimately we can leave it up to the runtime system to determine the best distribution of data and allocation of computation to reduce the execution time. In fact, the value of a programming model such as MapReduce is that its simplicity essentially allows the programmer to describe the expected results of each of the computational phases while relying on the complier and runtime systems for optimal parallelization while providing fault-tolerance.

Organizations are sitting on huge volumes of data in their data centers. The next wave of competitive advantage will be driven by exploiting these huge volumes of collected data. Business Analytics provides the required technology and tools to exploit the data.

The talk will provide an overview of the various areas of Business Analytics like Reporting, Predictive Analytics, Optimization and Simulation. Then we delve deeper into Predictive Analytics (or Data Mining) - to discover how to get valuable insights from the data. The talk will provide an overview of Predictive Analytics process some popular techniques and algorithms along with some of the industry use cases.

MapReduce is a programming model that was created (and is used) by Google in the early 2000s to process massive amounts of data. It’s name comes from the common functions in programming known as map and reduce, but they serve different functions than the traditional definitions. It is important to understand the concept because MapReduce technologies are responsible for decentralizing data storage and processing to increase the speed and reliability of dealing with large data sets. A popular free implementation is Apache Hadoop.2. NoSQL Databases--Document-oriented databases using a key/value interface rather than SQL (does not use the relational database management system – RDBMS) to classify and organize; created to manage volumes of data that do not have a fixed schema.

NoSQL gained popularity as major companies adopted the system due to an overload of data, which could not use the traditional RDBMS solutions. NoSQL databases provide quick, efficient performance because captured data is quickly stored using a single identifying key and therefore, can quickly store a lot of transactions.3. Storage--The technologies that hold the distributed data.

Data can be stored using data centers, which could include any number of various cloud technologies.4. Servers--Technologies available for renting computing power on remote machines; a program to “serve” the requests of programs.

In big data, servers offer support for data storage and management. 5. Processing--The action of extracting valuable information from large datasets.

Processing allows the user to sort through the massive amounts of data and produce information for analysis. While processing, data can be sorted and grouped based on algorithms, but it’s important to understand the limitations and constraints without applying human thought evaluation.6. Natural Language Processing--Extracting information from human-created text.

This type of processing requires sorting through data that is created from humans and not necessarily from their “actions.” For instance, if you are analyzing Twitter data from the previous six month, you might be looking for keywords and sentiment, which would require National Language Processing.

7. Visualization--Viewing graphically represented meaningful data.

As data is collected, stored and then analyzed, it needs to be presented in a way that it can be understood and digested. Programs able to analyze big data can sometimes interpret the data and represent it in a visual display for easier consumption and/or to show results.8. Acquisition--Techniques for cleaning up messy public data sources.

As data is collected, it is not always in its purest form and/or usable. There are various sources that help take this data and turn it into something that can be processed.

9. Serialization--Converting data structure or object state into a format able to be stored.

Serialization occurs after the data is collected and when it is being processed. As the data gets sorted and pushed around between systems, it may need to be stored. During these steps, the data will require serialization and it will be based on the different languages and APIs.10. CPU – Centralized processing unit; the hardware within a computer, which performs the basic operations of the system (comparable to the “brain”).

Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data.12. Petabytes, exabytes, zettabytes – Units of information used to measure data amounts.

The Economist Intelligence Unit surveyed over 600 business leaders, across the globe and industry sectors about the use of Big Data in their organizations. The research confirms a growing appetite for data and data-driven decisions and those who harness these correctly stay ahead of the game.

The report provides insight on their use of Big Data today and in the future, and highlights the advantages seen and the specific challenges Big Data has on decision making for business leaders.

Key Findings:

75% of respondents believe their organizations to be data-driven

9 out of 10 say decisions made in the past 3 years would have been better if they’d had all the relevant information

42% say that unstructured content is too difficult to interpret

85% say the issue is not about volume but the ability to analyze and act on the data in real time

More than half (54 percent) of respondents cite access to talent as a key impediment to making the most of Big Data, followed by the barrier of organizational silos (51 percent)

Other impediments to effective decision-making are lack of time to interpret data sets (46 percent), and difficulty managing unstructured data (39 percent)

71 percent say they struggle with data inaccuracies on a daily basis

62 percent say there is an issue with data automation, and not all operational decisions have been automated yet

Half will increase their investments in Big Data analysis over the next three years

The report reveals that nine out of ten business leaders believe data is now the fourth factor of production, as fundamental to business as land, labor, and capital. The study, which surveyed more than 600 C-level executives and senior management and IT leaders worldwide, indicates that the use of Big Data has improved businesses' performance, on average, by 26 percent and that the impact will grow to 41 percent over the next three years. The majority of companies (58 percent) claim they will make a bigger investment in Big Data over the next three years.

Approximately two-thirds of 168 North American (NA) executives surveyed believe Big Data will be a significant issue over the next five years, and one that needs to be addressed so the organization can make informed decisions. They consider their companies as 'data-driven,' reportingthat the collection and analysis of data underpins their firm's business strategy and day-to-day decision-making.

Fifty-five percent are already making management decisions based on "hard analytic information." Additionally, 44 percent indicated that the increasing volume of data collected by their organization (from both internal and external sources) has slowed down decision-making, but the vast majority (84 percent) feel the larger issue is being able to analyze and act on it in real-time.

The exploitation of Big Data is fueling a major change in the quality of business decision-making, requiring organizations to adopt new and more effective methods to obtain the most meaningful results from their data that generate value. Organizations that do so will be able to monitor customer behaviors and market conditions with greater certainty, and react with speed and effectiveness to differentiate from their competition.