INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING

Transcription

1 Returning to Java Grande: High Performance Architecture for Big Data INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING From Clouds and Big Data to Exascale and Beyond Cetraro (Italy) July Geoffrey Fox School of Informatics and Computing Digital Science Center Indiana University Bloomington

2 Abstract Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks as the kernel big data applications. We suggest that one must unify HPC with the well known Apache software stack that is well used in modern cloud computing and surely is most widely used data processing framework in the real world. We give some examples including clustering, deep learning and multi dimensional scaling. This work suggests the value of a high performance Java (Grande) runtime that supports simulations and big data.

9 Would like to capture essence of these use cases small kernels, mini apps Or Classify applications into patterns Do it from HPC background not database viewpoint e.g. focus on cases with detailed analytics Section 5 of my class https://bigdatacoursespring2014.appspot.com/preview classifies 51 use cases with ogre facets

12 51 Use Cases: What is Parallelism Over? People: either the users (but see below) or subjects of application and often both Decision makers like researchers or doctors (users of application) Items such as Images, EMR, Sequences below; observations or contents of online store Images or Electronic Information nuggets EMR: Electronic Medical Records (often similar to people parallelism) Protein or Gene Sequences; Material properties, Manufactured Object specifications, etc., in custom dataset Modelled entities like vehicles and people Sensors Internet of Things Events such as detected anomalies in telescope or credit card data or atmosphere (Complex) Nodes in RDF Graph Simple nodes as in a learning network Tweets, Blogs, Documents, Web Pages, etc. And characters/words in them Files or data to be backed up, moved or assigned metadata 12 Particles/cells/mesh points as in parallel simulations

13 51 Use Cases: Low Level (Run time) Computational Types PP(26): Pleasingly Parallel or Map Only MR(18 +7 MRStat): Classic MapReduce MRStat(7): Simple version of MR where key computations are simple reduction as coming in statistical averages MRIter(23): Iterative MapReduce or MPI Graph(9): complex graph data structure needed in analysis Fusion(11): Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal Streaming(41): some data comes in incrementally and is processed this way (Count) out of 51 13

25 SPIDAL (Scalable Parallel Interoperable Data Analytics Library) Getting High Performance on Data Analytics On the systems side, we have two principles: The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization HPC including MPI has striking success in delivering high performance, however with a fragile sustainability model There are key systems abstractions which are levels in HPC ABDS software stack where Apache approach needs careful integration with HPC Resource management Storage Programming model horizontal scaling parallelism Collective and Point to Point communication Support of iteration Data interface (not just key value) In application areas, we define application abstractions to support: Graphs/network Geospatial Genes Images, etc.

27 Useful Set of Analytics Architectures Pleasingly Parallel: including local machine learning as in parallel over images and apply image processing to each image Hadoop could be used but many other HTC, Many task tools Search: including collaborative filtering and motif finding implemented using classic MapReduce (Hadoop) Map Collective or Iterative MapReduce using Collective Communication (clustering) Hadoop with Harp, Spark.. Map Communication or Iterative Giraph: (MapReduce) with point to point communication (most graph algorithms such as maximum clique, connected component, finding diameter, community detection) Vary in difficulty of finding partitioning (classic parallel load balancing) Shared memory: thread based (event driven) graph algorithms (shortest path, Betweenness centrality) Ideas like workflow are orthogonal to this

32 One Facet of Ogres has Computational Features a) b) c) d) Flops per byte; Communication Interconnect requirements; Is application (graph) constant or dynamic? Most applications consist of a set of interconnected entities; is this regular as a set of pixels or is it a complicated irregular graph? e) Is communication BSP or Asynchronous? In latter case shared memory may be attractive; f) Are algorithms Iterative or not? g) Data Abstraction: key value, pixel, graph, vector Are data points in metric or non metric spaces? h) Core libraries needed: matrix matrix/vector algebra, conjugate gradient, reduction, broadcast

33 Data Source and Style Facet of Ogres (i) SQL (ii) NOSQL based (iii) Other Enterprise data systems (10 examples from Bob Marcus) (iv) Set of Files (as managed in irods) (v) Internet of Things (vi) Streaming and (vii) HPC simulations (viii) Involve GIS (Geographical Information Systems) Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds or lower (Real time control, streaming) There are storage/compute system styles: Shared, Dedicated, Permanent, Transient Other characteristics are needed for permanent auxiliary/comparison datasets and these could be interdisciplinary, implying nontrivial data movement/replication

39 Cluster Count v. Temperature for LC MS Data Analysis All start with one cluster at far left T=1 special as measurement errors divided out DA2D counts clusters with 1 member as clusters. DAVS(2) does not

43 Comparison of Data Analytics with Simulation I Pleasingly parallel often important in both Both are often SPMD and BSP Non iterative MapReduce is major big data paradigm not a common simulation paradigm except where Reduce summarizes pleasingly parallel execution Big Data often has large collective communication Classic simulation has a lot of smallish point to point messages Simulation dominantly sparse (nearest neighbor) data structures Bag of words (users, rankings, images..) algorithms are sparse, as is PageRank Important data analytics involves full matrix algorithms

44 Comparison of Data Analytics with Simulation II There are similarities between some graph problems and particle simulations with a strange cutoff force. Both Map Communication Note many big data problems are long range force as all points are linked. Easiest to parallelize. Often full matrix algorithms e.g. in DNA sequence studies, distance (i, j) defined by BLAST, Smith Waterman, etc., between all sequences i, j. Opportunity for fast multipole ideas in big data. In image based deep learning, neural network weights are block sparse (corresponding to links to pixel blocks) but can be formulated as full matrix operations on GPUs and MPI in blocks. In HPC benchmarking, Linpack being challenged by a new sparse conjugate gradient benchmark HPCG, while I am diligently using non sparse conjugate gradient solvers in clustering and Multi dimensional scaling.

45 Java Grande

46 Java Grande We once tried to encourage use of Java in HPC with Java Grande Forum but Fortran, C and C++ remain central HPC languages. Not helped by.com and Sun collapse in The pure Java CartaBlanca, a 2005 R&D100 award winning project, was an early successful example of HPC use of Java in a simulation tool for non linear physics on unstructured grids. Of course Java is a major language in ABDS and as data analysis and simulation are naturally linked, should consider broader use of Java Using Habanero Java (from Rice University) for Threads and mpijava or FastMPJ for MPI, gathering collection of high performance parallel Java analytics Converted from C# and sequential Java faster than sequential C# So will have either Hadoop+Harp or classic Threads/MPI versions in Java Grande version of Mahout

Parallelizing Data Analytics INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING From Clouds and Big Data to Exascale and Beyond Cetraro (Italy) July 10 2014 Geoffrey Fox gcf@indiana.edu

APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for

A New Book from Wiley Publisher to appear in late 2016 or early 2017 Big-Data Computing with Smart Clouds and IoT Sensing Kai Hwang, University of Southern California, USA Min Chen, Huazhong University

Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com,

The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

Exploiting Data at Rest and Data in Motion with a Big Data Platform Sarah Brader, sarah_brader@uk.ibm.com What is Big Data? Where does it come from? 12+ TBs of tweet data every day 30 billion RFID tags