7 Big Data Market further breakdown Big_Data_Database_Revenue_and_Market_Forecast_ USD: billions 7! NoSQL DB ==> Distributed DB, Document-Orinted DB, Graph NoSQL DB, and In-Memory NoSQL DB. It is not uncommon for an enterprise IT organization to support multiple NoSQL DBs alongside legacy RDBMSs. Indeed, there are single applications that often deploy two or more NoSQL solutions, e.g., pairing a documentoriented DB with a graph DB for an analytics solution. [Dec 2013]!

11 Definition and Characteristics of Big Data Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. -- Gartner! which was derived from:! While enterprises struggle to consolidate systems and collapse redundant databases to enable greater operational, analytical, and collaborative consistencies, changing economic conditions have made this job more difficult. E-commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity and variety. In 2001/02, IT organizations much compile a variety of approaches to have at their disposal for dealing each. Doug Laney 11

21 Apache Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.! The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.! The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. 21

22 Hadoop-related Apache Projects Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop clusters.it also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. ZooKeeper : A high-performance coordination service for distributed applications. 22

31 What is the fundamental challenge for RDB on Linked Data? In Relational DB, relationships are distributed. It takes a long time to JOIN to retrieve a graph from data Native Graph DB stores nodes and relationships directly, It makes retrieval efficient. Retrieving multi-step relationships is a 'graph traversal' problem Cited Graph Database O liey

34 How to Visualize Huge Static Graph species 14.8 million tweets 500 million users Tree of Life by Dr. Yifan Hu The information diffusion graph of the death of Osama bin Laden by Gilad Lotan Facebook friendship graph by Paul Butler Challenging Task : Squeezing millions and even billions of records into million pixels (1600 X million pixels) 34

35 Visualization Key Challenges Visual clutter Performance issues Cognition How can we encode the information intuitively? How can we render the huge datasets in real time with rich interactions? How can users understand the visual representation when the information is overwhelming? 35

44 Finding and Ranking Expertise Social Network Analysis Decades of Social Science studies demonstrates that (social) network structure is the key indicator determining a person's influence, organizational operation efficiency, social capital to get help, potential to be successful, etc. Who are the key bridges? Who have the most connections? How do these experts cluster? Analogy Google founders utilized the concept of network analysis on webpages to create ranking. Independent experts on healthcare Influencers are the one with high 'Betweeness' and 'Degree' values A cluster of XYZ experts UI to highlight experts based on my social proximity, the number of experts she connects, or the social bridges importance 44 SmallBlue analyzes underlining dynamic network structure in enterprise E6893 Big Data Analytics Lecture 519,545 1: Overview IBMer Network on May 9, 2012

45 User Interface of finding knowledgeable and influential colleagues Search for the most knowledgeable colleagues within organization or my 3-degree network for who knows topic XYZ (or within a country, a division, a job role, or any group/community) Based on IBM HR requirements, adding the 'sponsored search' for business department needs IBM HR gives a list of about 10,000 IBMers whose name should not be listed in the search result mostly high level managers, lawyers, people involving acquisition, etc. A list of 2,000+ words that are inappropriate to search in enterprise. My shortest path to Susan As a user, you can only see their public information. Private info is used internally to rank expertise but private data can never be exposed. Click a name to see their profile (SmallBlue Reach) 45

46 Visualize social roles of individuals in company Example: Healthcare experts in the world Connections between different divisions Example: Healthcare experts in the U.S. Key social bridges 46

47 Shortest Paths between two people in enterprise Example: Is Tom a right person to me? His official job role, title, contact info His public communities His self-described expertise The public interest groups he is in His blogs, forum, postings.. My various paths to Tom. SmallBlue can show the paths to any colleagues up to 6-degree away 47

48 Personal social network capital management What is a friend s social capital to me? Am I losing an 'important' friend? It can also show the evolution of my social network.. How many people in my personal networks? What types of unique colleagues my friend Chris can help me connect to? Analyzing existing social networks of every employee That makes it possible to find the shortest path to any colleague.. Evolutionalry personal social network 48

49 Network Value Analysis First Large-Scale Economical Social Network Study Productivity effect from network variables An additional person in network size ~ $986 revenue per year Each person that can be reached in 3 steps ~ $0.163 in revenue per month A link to manager ~ $1074 in revenue per month 1 standard deviation of network diversity (1 - constraint) ~ $758 1 standard deviation of btw ~ -$300K 1 strong link ~ $-7.9 per month Structural Diverse networks with abundance of structural holes are associated with higher performance. Having diverse friends helps. Betweenness is negatively correlated to people but highly positive correlated to projects. Being a bridge between a lot of people is bottleneck. Being a bridge of a lot of projects is good. Network reach are highly corrected. The number of people reachable in 3 steps is positively correlated with higher performance. Having too many strong links the same set of people one communicates frequently is negatively correlated with performance. Perhaps frequent communication to the same person may imply redundant information exchange. 49

58 Dynamics of Information Graphs in Social Media Motivation: Info morph: new links keep emerging to give new meaning to existing phrases Approach: Compare characteristics of metapaths between nodes in heterogeneous networks weibo Peace West King from Chongqing fell from power, still need to sing red songs? Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs. Entity morph resolution accuracy (ACL 2013) 58 58

62 Measuring Human Essential Traits in Social Media Personality: Mapping personal/ organizational social media postings to scores of BIG 5 Personality (Openness, Conscientiousness, Extraversion, Agreeableness, and Neurocism)! Needs: Mapping personal/organizational social media postings to scores of Harmony, Curiousity, Self-expression, Ideal, Excitement, and Closeness.! Values: Mapping personal/organizational social media postings to scores of Self- Enhance. Conservation, Open-to-Change, Hedonism, and Self-Transcend. Trustingness and Trustworthness: Deriving from interaction and propagation history between the user and his followers and the people he follows.! Influence: Total attention received by user as leader across all discovered flows. Precision-Recall performance of predicting info propagation by different features (Our proposed influence index: FLOWER) 62

63 Flow Analytics - I Topic cluster tree shows how sequences content are related to each other Timeline view shows how users of different characteristics responded in each sequence MDS view shows how anomalies distribute Feature and State view shows the features of a sequence, and how they transition from one state to another 63

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Taking Data Analytics to the Next Level Implementing and Supporting Big Data Initiatives What Is Big Data and How Is It Applicable to Anti-Fraud Efforts? 2 of 20 Definition Gartner: Big data is high-volume,

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader

Beyond Watson: The Business Implications of Big Data Shankar Venkataraman IBM Program Director, STSM, Big Data August 10, 2011 The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT

A financial software company Projecting USD10 million revenue lift with the IBM Netezza data warehouse appliance Overview The need A financial software company sought to analyze customer engagements to

Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment

Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For

Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.