Introduction to BIG data analytics with Hadoop

Comments (0)

Transcript of Introduction to BIG data analytics with Hadoop

Part 1 :Understanding BigData Analytics Every day, we create 2.5 quintillion bytes of data (2.2TB) — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. -IBM 4 dimensions of BigData Volume Variety Velocity 1 Veracity 2 3 4 Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. 1 in 3 business leaders don't trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows. Conclusion ? Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Points to remember Everyone is interested in collecting data.Real time Analytics of this large data is characteristic of BigData.Handling structured ,semi structured & unstructured data .Effective visualization of nuggets How is this different from Large data analysis to which it is often confused with ? For large data analysis - one doesn't care about the realtime analytics .Its usually done by experts in their convince of time .This is "unproductive" as the rate of inflow of data is huge and analysis result are delivered very late .In some cases analysis is available even after the closure of the event.Opportunity : Every sector organization is either collecting/planning to collect data but they don't have any "Automated software solution" to help them analyze ,view realtime results . They would just want to hire "Data scientist " otherwise . present solutions Hortonworks Data platform Mapreduce & HDFS :Distrbuted components for processing & Analysis .[BI/Data Mining ] PIG & HIVE : DB Query Integration services : Port data from external sources via API'sHbase : Non-Sql Oozie,Ambari :Data Managment

"Impressive" Scope There are multiple uses for big data in every industry – from analyzing large volumes of data than was previously possible to drive more precise answers, to analyzing data in motion to capture opportunities that were previously lost. A big data platform will enable your organization to tackle complex problems that previously could not be solved. Part2 :Quick Technology to get you started ? Distributed Computing : MapReduce & HDFS Lets see an example : BigData in Energy & Utility industry What is Distributed computing ? Mapreduce How is it a saviour in BigData approach ? Distributed computing is a field of computer science that studies distributed systems.

A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers.

In parallel computing, all processors may have access to a shared memory to exchange information between processors.In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors Hadoop distributed file system HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.

Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster.

The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS.

The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware Hardware Failure :Some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Streaming Data Access :Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users.

Large Data Sets : Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to thousands of nodes in a single cluster. It should support tens of millions of files in a single instance.

an ideal file size is a multiple of 64 MBm is broken into n files .Size of m is 64mb.Block = n *512kbStorage size of n is 512kb 2 1 Once the split is made it is categorized as "Namenode"(Unique for a block) or "Datanode".It is then replicated and saved in various clusters across a server farm.

Ex : Find the repeated words "Green" "orange" in a pool of data .Suppose we divided the data into 3 blocks(m=3) and it was futher subdived into various n blocks.Every m block will have 3 datanodes .We are not getting into detail about the n node results at the moment .