6 Finance, Banking, Insurance: Biggest investor in IT 6.3% of revenue spent on IT, #2 3.3% industry average Among top Industries Investing in IT IT is considered strategic for Business Speed, Efficiency in Trading and Risk management with cost control are a key differentiators Smaller trends to externalize development

10 High Availability «Even the most reliable hardware fails» Tradi&onnal Approach Big Data Data replica1on 0-60s? Tradi&onal: hardware redundancy Expensive hardware that must be replicated Risk during failure Long recovery 1me Big data: data redundancy In the core of the architecture Memory and Network faster than Disk Use collocated server as fast backup system Up to 30% of hardware lost => Best service con1nuity

11 What (with Hadoop Wording)? Calcula1on Map Reduce Database HBASE Query PIG/HIVE ETL SQOOP Data Mining MAHOUT Distributed File System HDFS Execu1on YARN Consensus ZOOKEEPER JVM OS Quite independent components: You can use ZooKeeper alone You can use HBase on top of a different file system You can do HIVE queries w/ or w/o HBase And so on. Hive: SQL access, and JDBC Hadoop, Linux of Big Data

16 User API {rowkey => {family => {qualifier => {version => value}}}} Think: nested TreeMap (Java), OrderedDictionary (C#), OrderedDict (Python) Basic data operations: GET, PUT, DELETE SCAN over range of key-values benefit of the sorted rowkey business this is how you implement any kind of "complex query * GET, SCAN support Filters Push application logic to RegionServers INCREMENT, APPEND, CheckAnd{Put,Delete} Server-side, atomic data operations, can be contentious! * This is also a foundational component in what we refer to as schema design in this schemaless database.

17 Big Data Pitfalls and Challenges in Capital Market Big Data is about Volume, Velocity, Variety at cheap price but What about Verifiability and Veracity? (Transaction ie. ACID) CM mostly have structured data We are doing real-time data, not only batch What about legacy systems and education? (our staff speaks mainly SQL) We are already doing big data (Data Grid, Compute Grid, Exadata, ) What about Disaster Recovery Plan, Infrastructure, online backup (Regulatory)? It will be a huge project (multi-year) And we want to deliver quickly new business values! Let s see some applications of Big Data in Capital Market

19 Consolidation challenge / Too Many Product Models As many trade schema than systems, product line (Equity, Fixed Income, Foreign Exchange ) and service layer (FO, BO ) Real challenge to find a common dictionary if possible Huge problem for consolidation systems (360 Customer view, Risk, Compliance) As many column groups as system to integrate for a row Start to be Big Data (billions of lines, To of Data); many process in //; one hour SLA Common fields Specific fields (*) While this massive amount of data needs common enrichment (data normaliza1on, valua1on), aggrega1on or deep analyses, transac1ons and schema are fundamental issues! *: Reality: hundred of columns

20 Versioning and audit of trade transactions In RDBMS Keep all versions in same table and flag the last version Or use an history table and keep only last version of deals Difficulty to identify modified fields Must be managed by application (RDBMS doesn t help here) What-If data must be accessed by different systems? Last Version Flag, Previous field Modified Fields Cell Level access Now consider storage not as a static array of data but as a set of cells Cell can be accessed and modified independently Cells are versioned (timestamp). Each revision can be retrieved easily by API Modified Fields

21 Trade Columnar Representation Column version: By default, only last version is return Row atomicity All KVs for a row are co-located in a region Locks are per row Stored in-memory at region server

22 Schema Less Model No constraint on column qualifier or value per row Data is ordered by Row Key Sharding is managed by Hbase Proper Row Key design is a cri1cal for system performance (latency/throughput on Read/ Write usage)

23 How to manage relationship between entities? (1/3) HBase has no foreign keys, or joins, or cross-table transactions. This can make representing relationships between entities... tricky. HBase columns can be defined at runtime. A set of dynamically named columns can represent another entity! If you put data into the column name, and expect many such columns in the same row, then logically, you've created a nested entity. Classic use case in finance: Trade, Trade Events and Cash Flows Most of the time you have 1-n relationship so nested entities could work just find. But you don t have aggregate functions (avg, sum,..) any more.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

Postgres & Hadoop Agenda! Strengths of PostgreSQL! Strengths of Hadoop! Hadoop Community! Use Cases Best of Both World Postgres Hadoop World s most advanced open source database solution Enterprise class

Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years sarah@cloudera.com 2 What is Hadoop? An open-source

The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.