6 Introduction & Motivation Generic Problem: Nowadays, the huge amounts of data available pose problems for analysis with regular hardware and/or software. Solution: Emerging technologies, like modern models for parallel computing, multicore computers or even clusters of computers, can be very useful for analyzing massive network data. 6

7 Tutorial Overview & Contributions 1. Aggregation of information: a. What tools to use for analyzing large social networks b. What algorithms are already implemented with these tools c. Several Tools - Advantages and Disadvantages 2. Implementation Example of algorithms for large scale Social Network analysis and some results: a. Community Detection algorithm implementation with Green-Marl language b. Similarity Ranking algorithm implementation also with Green-Marl language 7

16 Software Tools Advantages & Disadvantages Tool Pegasus Graphlab Giraph Snap Advantages Similar positive points to Hadoop MR Algorithms can be described in a node-centric way; same computation is repeatedly performed on every node. Significant amounts of computations are performed on each node. Can be used for any Graph as long as their sparse. Several advantages over Map Reduce: - it s a stateful computation - Disk is hit if/only for checkpoints - No sorting is necessary - Only messages hit the network as mentioned from Martella (2012) Optimized for Graph processing. Written with C++ which is intrinsically considered a fast language Disadvantages Similar negative points to Hadoop MR Programmability: user must restructure his algorithm in a node centric way. There is an overhead of runtime system when the amount of computation performed at each node is small. Small world graphs: Graphlab lock scheme may suffer from frequent conflicts for such graphs. Still in a very immature phase of development Lack of a complete offered algorithm library Not developed to take advantage of parallel or distributed processing of tasks Some algorithms can be time consuming even for relatively small graphs due to the number of graph characteristics covered (eg. centrality algorithm) 16

19 Software Tools Case Studies - Metrics and their practical use Triangles involved in the computation of one of the main statistical property used to describe large graphs met in practice and that is the clustering coefficient of the node. K-Core The concept of a k-core was introduced to study the clustering structure of social networks from and to describe the evolution of random graphs. It has also been applied in bioinformatics and network visualization. Friends of Friends this algorithm is of good application in the commercial data networks where the results could serve as basis for a recommender system. Centrality Measures The centrality measures algorithms have large application in several areas including Psychology, Anthropology, Business and communications, Ecology among many others. 19

25 Algorithm Developments Green-Marl Language Green-Marl, a DSL in which a user can describe a graph analysis algorithm in a intuitive way. This DSL captures the high-level semantics of the algorithm as well as its inherent parallelism. The Green-Marl compiler which applies a set of optimizations and parallelization enabled by the high-level semantic information of the DSL and produces an optimized parallel implementation targeted at commodity SMP machines. An interdisciplinary DSL approach to solving computational problems that combines graph theory, compilers, parallel programming and computer architecture. 25

26 Algorithm Developments Green-Marl Language - Available Algorithms Green-Marl Software Algorithms Brief Description OpenMP C++ compatible avg_teen_count Computes the average teen count of a node YES YES bc Computes the betweenness centrality value for the graph YES NO bc_random Computes an estimation for the betweenness centrality value for YES YES the graph communities Computes the different communities in a graph YES NO kosaraju Finds strongly connected components using Kosaraju's Algorithm YES NO pagerank Computes the pagerank value for every node in the graph YES YES potential-friends sssp sssp_path Computes a set of potential friends for every node using triangle closing Computes the distance of every node from one destination node according to the shortest path Computes the shortest paths from one destination node to every other node in the graph and returns the shortest path to a specific node. YES YES YES Giraph/GPS compatible NO YES NO triangle_counting Computes the number of closed triangles in the graph YES NO 26

28 Algorithm Developments Community Detection Community detection is known to be a NP-complete problem. Community detection can be related to graph partitioning and there are good parallel algorithms for graph partitioning but for community detection it is a usual problem that relies on parallelism achievable from sequential algorithms. The top-down approach (divisive approach) or bottom-up approach (agglomerative approach) have inherent sequential flow with possibility of being parallelized on a higher amount on the first stages than the later stages. Because of the high computational overhead of community detection algorithms one cannot usually apply such algorithms to networks of hundreds of millions of nodes or edges. Thus, an efficient and high quality algorithm (modularity) for community detection is hard to achieve and a challenging problem as mentioned by Soman and Narang (2011). 28

29 Algorithm Developments Similarity Ranking Algorithm SimRank proposed by Jeh and Widom (2002) has become a measure to compare the similarity between two nodes using network structure. Although SimRank is applicable to a wide range of areas such as social networks, citation networks, link prediction and others, it suffers from heavy computational complexity and space requirements. The basic recursive intuition behind SimRank approach is two objects are similar if they are referenced by similar objects. Being an algorithm with O(n 2 ) time complexity where n is the number of nodes in the graph, it is a good choice to develop it in distributed computing environments. 29

42 Summary & Conclusions One of this part of the tutorial goals was to expose which tools to look for when dealing with big graphs studies. We made the introduction to the tools used nowadays for distributed graph analysis We wrote some practical examples of computing algorithms that leverage the tools potential for big scale graphs studies Other tutorial goal was to prove the utility and diversity of the tools and algorithms available for graph studies. We learned also that the increasing number of SDLs for big graph analysis make the choice of languages for programming tasks between two generic languages, C++ and Java. The Green-Marl language was also a great tool in the set of tools available and some implementation results are given in this tutorial. 42

43 Summary & Conclusions Support Documents Large Scale Social Networks Analysis Thesis Document available for download on: _Analysis_-_2013_-_Aftermath.pdf Code available for download: 43

Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical

BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of