3 Big data was big news in 2012 and probably in 2013 too. The Harvard Business Review talks about it as The Management Revolu0on. The Wall Street Journal says Meet the New Big Data, and Big Data is on the Rise, Bringing Big Ques0ons.

7 Where Big Data Comes From? Big Data is not Specific applica*on type, but rather a trend or even a collec:on of Trends- napping mul:ple applica:on types Data growing in mul:ple ways More data (volume of data ) More Type of data (variety of data) Faster Ingest of data (velocity of data) More Accessibility of data (internet, instruments, ) Data Growth and availability exceeds organiza:on ability to make intelligent decision based on it Addison Snell CEO. Intersect360, Research

8 Data is Big If It is Measured in MW A good sweet spot for a data center is 15 MW Facebook s leased data centers are typically between 2.5 MW and 6.0 MW. Facebook s Pineville data center is 30 MW Google s compu:ng infrastructure uses 260 MW Robert Grossman, Collin BenneC University of Chicago Open Data Group

9 Jim Gray Vision We have to do bemer at producing tools to support the whole research cycle from data capture and data cura*on to data analysis and data visualiza*on. Today, the tools for capturing data both at the mega- scale and at the milli- scale are just dreadful. Aaer you have captured the data, you need to curate it before you can start doing any kind of data analysis, and we lack good tools for both data cura*on and data analysis. Then comes the publica*on of the results of your research, and the published literature is just the :p of the data iceberg. By this I mean that people collect a lot of data and then reduce this down to some number of column inches in Science or Nature or 10 pages if it is a computer science person wri:ng. So what I mean by data iceberg is that there is a lot of data that is collected but not curated or published in any systema*c way. Based on the transcript of a talk given by Jim Gray to the NRC- CSTB1 in Mountain View, CA, on January 11, 2007

10 Advice From Jim Gray 1. Analysing Big data requires scale- out solu:ons not scale- up solu:ons (GrayWulf) 2. Move the analysis to the data. 3. Work with scien:sts to find the most common 20 queries and make them fast. 4. Go from working to working. Robert Grossman, Collin BenneC University of Chicago Open Data Group

12 How We Define Big Data Big in Big Data refers to: Big size is the primary defini:on. Big complexity rather than big volume. it can be small and not all large datasets are big data size mamers... but so does accessibility, interoperability and reusability. define Big Data using 3 Vs; namely: volume, variety, velocity Big Data - Back to mine 12

13 volume, variety, and velocity Aggrega:on that used to be measured in petabytes (PB) is now referenced by a term: zenabytes (ZB). A zenabyte is a trillion gigabytes (GB) or a billion terabytes in 2010, we crossed the 1ZB marker, and at the end of 2011 that number was es:mated to be 1.8ZB

16 volume, variety, and velocity The variety characteris:c of Big Data is really about trying to capture all of the data that pertains to our decision- making process. Making sense out of unstructured data, such as opinion, or analysing images.

18 volume, variety, and velocity velocity is the rate at which data is generated and is processed or well understood In other terms How long does it take you to do something about it or know it has even arrived?

19 volume, variety, and velocity Today, it is possible using real- :me analy:cs to op:mize Like bumons across both website and on Facebook. FaceBook use anonymised data to show you the number of :mes people saw Like bumons, clicked Like bumons, saw Like stories on Facebook, and clicked Like stories to visit a given website.

20 volume, variety, velocity, and veracity Veracity refers to the quality or trustworthiness of the data. A common complica:on is that the data is saturated with both useful signals and lots of noise (data that can t be trusted)

25 Data Analy:cs Analy:cs Characteris:cs are not new Value: produced when the analy:cs output is put into ac:on Veracity: measure of accuracy and :meliness Quality: well- formed data Missing values cleanliness Latency: :me between measurement and availability Data types have differing pre- analy:cs needs

26 Twitter Coun*ng How many request/day? What s the average latency? How many signups, sms, tweets? Correla*ng Desktop vs Mobile user? What devices fail at the same :me? What features get user hooked? Research What features get re- tweeted Duplicate detec:on Sen:ment analysis Copyright 2011 Gigaspaces Ltd. All Rights Reserved

34 hmp://www.utdallas.edu/~chung/sa/2clien Database Transactions Transactions are a way to make ACID operations a general commodity [Transaction Processing Concepts and Techniques, J. Gray and A. Reuter, 1993] Atomicity Isolation a transaction is an indivisible unit of work a transaction s behavior not an all-or-nothing proposition affected by other transactions all updates to a database, running concurrently e.g., reserve displays on the clients screens, a seat message queues e.g., salary increase for all 1 million serialization techniques employees or none Consistency Durability a transaction is an indivisible unit Persistence of work S -> [T abort] -> S a transaction s effects are integrity constraints permanent after it commits

50 Real :me data analy:cs Apache Storm is a free and open source distributed real :me computa:on system. makes it easy to reliably process unbounded streams of data. simple, can be used with any programming language Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. Storm integrates with the queueing and database technologies you already use

53 The problem TCP Was never designed to move large datasets over wide area high Performance Networks. For loading a webpage, TCP is great. For sustained data transfer, it is far from ideal. Most of the :me even though the connec*on itself is good (let say 45Mbps), transfers are much slower. There are two reason for a slow transfer over fast connec:ons: Latency and packet loss bring TCP- based file transfer to a crawl. Robert Grossman University of Chicago Open Data Group, November 14, 2011

55 The solu:ons Use parallel TCP streams GridFTP Use specialized network protocols UDT, FAST, etc. Use RAID to stripe data across disks to improve throughput when reading These techniques are well understood in HEP, astronomy, but not yet in biology Robert Grossman University of Chicago Open Data Group, November 14, 2011

57 Case study: CGI 60 genomes Trace by Complete Genomics showing performance of moving 60 complete human genomes from Mountain View to Chicago using the open source Sector/UDT. Approximately 18 TB at about 0.5 Mbs on 1G link. Robert Grossman University of Chicago Open Data Group, November 14, 2011

58 How FedEx Has More Bandwidth Than the Internet and When That'll Change If you're looking to transfer hundreds of gigabytes of data, it's s:ll weirdly faster to ship hard drives via FedEx than it is to transfer the files over the internet. Cisco es:mates that total internet traffic currently averages 167 terabits per second. FedEx has a fleet of 654 aircraa with a lia capacity of 26.5 million pounds daily. A solid- state laptop drive weighs about 78 grams and can hold up to a terabyte. That means FedEx is capable of transferring 150 exabytes of data per day, or 14 petabits per second almost a hundred *mes the current throughput of the internet. hmp://gizmodo.com/ /how- fedex- has- more- bandwidth- than- the- internetand- when- thatll- change

60 When to Consider a Big Data Solu:on User point of view You re limited by your current pla]orm or environment because you can t process the amount of data that you want to process You want to involve new sources of data in the analy:cs, but you can t, because it doesn t fit into schema- defined rows and columns without sacrificing fidelity or the richness of the data You need to ingest data as quickly as possible and need to work with a schema- on- demand

61 When to Consider a Big Data Solu:on You re forced into a schema- on- write approach (the schema must be created before data is loaded), but you need to ingest data quickly, or perhaps in a discovery process, and want the cost benefits of a schema- on- read approach (data is simply copied to the file store, and no special transforma:on is needed) un:l you know that you ve got something that s ready for analysis? The data arriving too fast at your organiza:on s doorstep for the current analy:cs plahorm to handle

62 When to Consider a Big Data Solu:on You want to analyse not just raw structured data, but also semi- structured and unstructured data from a wide variety of sources you re not sa:sfied with the effec:veness of your algorithms or models when all, or most, of the data needs to be analysed or when a sampling of the data isn t going to be nearly as effec:ve

63 When to Consider a Big Data Solu:on you aren t completely sure where the inves:ga:on will take you, and you want elas*city of compute, storage, and the types of analy:cs that will be pursued all of these became useful as we added more sources and new methods If your answers to any of these ques*ons are yes, you need to consider a Big Data solu*on.

65 Scien:fic e- infrastructure some challenges to overcome Collec:on How can we make sure that data are collected together with the informa*on necessary to re- use them? Trust How can we make informed judgements about whether certain data are authen*c and can be trusted? How can we judge which repositories we can trust? How can appropriate access and use of resources be granted or controlled Riding the wave, How Europe can gain from the rising :de of scien:fic data

66 Scien:fic e- infrastructure some challenges to overcome Usability How can we move to a situa:on where non- specialists can overcome the barriers and be able to start sensible work on unfamiliar data Interoperability How can we implement interoperability within disciplines and move to an overarching mul:- disciplinary way of understanding and using data? How can we find unfamiliar but relevant data resources beyond simple keyword searches, but involving a deeper probing into the data How can automated tools find the informa:on needed to tackle data Riding the wave, How Europe can gain from the rising :de of scien:fic data

67 Scien:fic e- infrastructure some challenges to overcome Diversity How do we overcome the problems of diversity heterogeneity of data, but also of backgrounds and data- sharing cultures in the scien:fic community? How do we deal with the diversity of data repositories and access rules within or between disciplines, and within or across na:onal borders? Security How can we guarantee data integrity? How can we avoid data poisoning by individuals or groups intending to bias them in their interest? Riding the wave, How Europe can gain from the rising :de of scien:fic data

69 Scien:fic e- infrastructure a wish list Open deposit, allowing user- community centres to store data easily Bit- stream preserva*on, ensuring that data authen:city will be guaranteed for a specified number of years Format and content migra*on, execu:ng CPU- intensive transforma:ons on large data sets at the command of the communi:es Riding the wave, How Europe can gain from the rising :de of scien:fic data

70 Scien:fic e- infrastructure a wish list Persistent iden*fica*on, allowing data centres to register a huge amount of markers to track the origins and characteris:cs of the informa:on Metadata support to allow effec:ve management, use and understanding Maintaining proper access rights as the basis of all trust A variety of access and cura*on services that will vary between scien:fic disciplines and over :me Riding the wave, How Europe can gain from the rising :de of scien:fic data

71 Scien:fic e- infrastructure a wish list Execu*on services that allow a large group of researchers to operate on the stored date High reliability, so researchers can count on its availability Regular quality assessment to ensure adherence to all agreements Distributed and collabora*ve authen:ca:on, authorisa:on and accoun:ng A high degree of interoperability at format and seman:c level Riding the wave, How Europe can gain from the rising :de of scien:fic data

High Performance Compu2ng and Big Data High Performance compu2ng Curriculum UvA- SARA h>p://www.hpc.uva.nl/ Big data was big news in 2012 and probably in 2013 too. The Harvard Business Review talks about

SINGLE PLATFORM. COMPLETE SCALABILITY. Real Time Analy:cs for Big Data Lessons Learned from Facebook @uri1803 Head of Product GigaSpaces About Me MTBK Junky A Proud Dad Technology addict Head of Product

The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

An Open Dynamic Big Data Driven Applica3on System Toolkit Craig C. Douglas University of Wyoming and KAUST This research is supported in part by the Na3onal Science Founda3on and King Abdullah University

Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

Big Data and Big Data Modeling The Age of Disruption Robin Bloor The Bloor Group March 19, 2015 TP02 Presenter Bio Robin Bloor, Ph.D. Robin Bloor is Chief Analyst at The Bloor Group. He has been an industry

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Theo JD Bothma Department of Informa1on Science theo.bothma@up.ac.za Reflec1ons on the role of corpora and big data in e- lexicography in rela1on to end user informa1on needs CILC 2015 7th Interna1onal

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,

GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

Big Architecture Research at UvA Yuri Demchenko System and Network Engineering Group, University of Amsterdam ISO/IEC SGBD Big Technologies Workshop Part of ISO/IEC Big Study Group meeting 13-16 May Outline

School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Big Data The Big Picture Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas What is Big Data? Big Data gets its name because that s what it is data that

Data Warehousing Yeow Wei Choong Anne Laurent Databases Databases are developed on the IDEA that DATA is one of the cri>cal materials of the Informa>on Age Informa>on, which is created by data, becomes

Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra dwd@fnal.gov Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359