Transcription

1 Big Data Frameworks Course Prof. Sasu Tarkoma

2 Contents Course Overview Lectures Assignments/Exercises

3 Course Overview This course examines current and emerging Big Data frameworks with focus on Data Science applications. The course starts with an introduction to MapReduce-based systems and then focuses on Spark and the Berkeley Data Analytics (BDAS) architecture. The course covers traditional MapReduce processes, streaming operation, machine learning and SQL integration. The course consists of the lectures and the assignments. Course is focused on assignments/exercises Running distributed code!

4 Data Science Education Data Science study profile : An MSc level programme that combines elements from different subfields of computer science Trains new generations of data scientists for the industry, academia, and administration Organized together by two sub-programmes of the Department of Computer Science the Algorithms, Data Analytics and Machine Learning sub-programme the Networking and Services sub-programme Language of education is English Data Science Study Profile:

11 Assignments/exercises Environment: Spark 1.2, we use Scala 2.10, no support for Python! Scala IDE Eclipse recommended, check Spark version (there are large differences) Weekly exercise Problem Sheet Detailed instructions provided in the problem sheet Completed questions contribute to the grade Total points determine 40% of the grade The last Problem Sheet is more involving and contributes more to the points Moodle is used to return the answers IRCnet channel #tkt-bdf

13 Grading Course grading will be based on the final exam and the assignments/exercises. Exam 60% and exercises 40% of the grade. Exam Friday :00 at B123

14 Main theme Prerequisites Approaches learning goals Meets learning goals Deepens learning goals Big Data Frameworks: definitions and systems Basics of data communications and distributed systems (Introduction to Data Communications, Distributed Systems) Knowledge of how to define the concepts of MapReduce and variants and state their central features Ability to describe at least one system in detail Ability of being able to compare different Big Data frameworks in a qualitative manner Ability to assess the suitability of different systems to different use cases Ability to give one s own definition of the central concepts and discuss the key design and deployment issues Internal operation and implementation of a Big Data framework Basics of data communications and distributed systems (Introduction to Data Communications, Distributed Systems) Big-O-notation and basics of algorithmic complexity Basics of reliability in distributed systems Knowledge of the design and implementation level concepts of Big Data frameworks, specifically Hadoop and Spark. Knowledge of how distributed state is maintained and synchronized. Understanding of the communication and computational costs in Big Data processing. Ability of being able to compare different Big Data frameworks based on their design and implementation. Ability of designing distributed Big Data systems building on existing frameworks for batch and streaming processing. Knowledge of key performance issues and the ability to analyze these systems The knowledge of designing a Big Data platform for a given problem Familiarity with the state of the art Ability to describe at least one algorithm in detail Knowledge of the most important factors pertaining to reliability Distributed algorithms for Big Data frameworks Basics of algorithm design and machine learning Knowledge of the basic design of a distributed algorithm for MapReduce and Spark. Ability to use graph processing and machine learning in a distributed cluster environment Ability to design and implement a solution that uses distributed algorithms for a large dataset Ability to create both batch and streaming solutions Design and implementation of a new machine learning algorithm for Big Data Familiarity with the state of the art Data Science applications - Knowledge of the basic Data Science use cases based on Big Data frameworks Knowledge of at least two Data Science use cases and how they use the Big Data framework Knowledge of Data Science pipelines Familiarity with the state of the art Automation of Data Science pipelines

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

City University of Hong Kong offered by Department of Computer Science with effect from Semester A 2015/16 Part I Course Overview Course Title: Course Code: Course Duration: Credit Units: Level: Medium

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Orientation Program for Students of Our MSc. Programs Business Administration, Economics and MEMS Information Systems Prof. Dr. Stefan Lessmann Agenda What it is about Information Systems Who we are What

Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 Part I Course Title: Data-Intensive Computing Course Code: CS4480

Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level? Dr. Frank Lee Chair, ECE/CS/IT New York Institute of Technology Old Westbury, NY 11568 Topics This talk describes:

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks

CRITEO INTERNSHIP PROGRAM 2015/2016 A. List of topics PLATFORM Topic 1: Build an API and a web interface on top of it to manage the back-end of our third party demand component. Challenge(s): Working with

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects