Big Data

Comments (0)

Transcript of Big Data

Introduction to Big DataBig DataLarge corpuses of data coming togetherSensors collecting information through web logs, security flows, and monitoring information.Need of Big DataAround terabytes and petabytes of data being generated.Sensors, web logs, imagery and data streaming from devices – is growing.Dimensions of Big DataBig Data spans six dimensions :Classification of Big DataOn the basis of nature Big Data can be classified as follows:Unstructured DataSemi-Structured DataStructured DataData at restImportance of Big DataBig Data plays an important roles in various domains such as :

VolumeVelocityData in motionVarietyData in many formsVeracityData in doubtDecision MakingVariabilityData in many waysComplexityIntermingling of dataStudies have shown that businesses which have adopted Big Data strategies earlier and enabled data driven decision making were able to achieve 5% to 6% greater productivity.

This has helped in many sectors such as Telecom sector, Retail sector, Manufacturing sector etc.

Business IntelligenceUnstructured DataNo identifiable structureConsists of mainly loosely structured dataThis kind of data is inconsistent and always unique.Ex- media files, word files, pdf files, ppt presentationsIn Business Intelligence, data is analyzed for many purposes such as to perform system log analytics and social media analytics for risk assessment, customer relation, brand management etc.

Big data analytics helps to do real time analysis of data thus providing an help to Business Intelligence tools.Semi-Structured DataNo fixed schemaAlso known as 'self-describing' dataContains information about the schema of the data contained in it.Social DevelopmentStructured DataIndustries like Healthcare, surveys done by government agencies like NSSO, NGOs collect data from a mixture of people. This information is helpful in studying general social and economic condition of people and thus helps in planning developmental projects.This type of data is grouped into relational schema and can be analyzed with the help of simple queries. This type of data is grouped into relational schema and can be analyzed with the help of simple queries.Heterogeneity and IncompletenessChallenges faced during Big Data analysisThe analysis of Big Data involves multiple distinct phases as shown in the figure below, each of which introduces challenges.The top challenge cited are: the rate data growth and the cost and effort required to contain or store it. The various challenges faced are:Unlike humans, computers cannot understand heterogeneous data. Machine analysis algorithms expect homogeneous data. Therefore, data is required to be structured prior to data analysis. To cope up with this challenge Google applies MapReduce to all the complex heterogeneous data it gathers from internets.ScaleThis refers to the sheer volume of data being accumulated. (terabytes)To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel.TimelinessPreferably known as the acquisition and processing rate based upon the volume of data. Consider a fraudulent credit card transaction. It is ideally required to stop the process before the transaction takes place at all. To cope up with this scenario we need a proper result at proper time. Hence, velocity is a major challenge being faced.PrivacyManaging the privacy of data is a major concern for all kinds of organizations. It is an important part of data analysis to protect the privacy of people.

SecurityLacking security may lead to steal sensitive information assets – intellectual property, credit card numbers, customer databases – commit fraud, or otherwise damage the enterprise. Many Big Data security solutions have emerged in market like Gazzang, Ncrypt™.Access and SharingIt refers to the need of proper accessibility rights. There is a large amount of data that is closely held by the corporations and is not accessible in public because there exists a culture of secrecy.Human CollaborationThere is a great value of human input at all stages of the analysis pipeline. In spite of great advancements in computational analysis and Big Data the importance of Human Collaboration is not reduced.Technology OutlookThis section would elaborate the technology being selected for storing, managing and analyzing Big Data. HadoopIts an open source framework that allows distributed storage and processing of large data sets over clusters of computers. Hadoop has become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.The Hadoop libraries itself are designed to detect and handle failures at the application layer, hence enabling it to deliver high-availability service.Hadoop Distributed File System (HDFS)HDFS is a file system used by Hadoop which is ideal for storing data of size terabytes and petabytes. HDFS lets you connect nodes contained within clusters over which data files are distributed. One can then access and store the data files as one seamless file system. Data in Hadoop cluster is broken down into smaller pieces called blocks and distributed throughout the clusters. The default block size for Hadoop is 64 MB.Map Reduce MapReduce is a framework for performing distributed data processing using the Map-Reduce programming paradigm. The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine.Job TrackerIt is the master of the system which manages the jobs and resources in the cluster. If any of the job fails, it reallocates the job to another node.Task TrackerThese are the slaves that are deployed on each machine.TaskTracker also constantly sends a message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not. Job History ServerIt is a process for serving up information about historically completed applications so that the JobTracker does not need to track them.Hadoop Based Search Engine Q-SearchDescription of ProjectThe major issue involved in this project was the type of data in which searching is be done. The type which we have chosen is highly unstructured containing a large amount of data. The files contain a large amount of pictures and different types of tags and formatting elements which need to be removed for effective searching. The second major issue involved is to ensure that searching is done in optimum time and gives a reliable output, to do this we have used various methods like caching, etc. We have also sorted the displayed results on the basis of number of hits. To improve the user experience we have done pagination of results to be displayed.Use Case DiagramWorking of ProjectUser enters a Search StringCaching Module Checks in CacheThis module would be the first one to be invoked which involves searching into the cache for the search query entered by the user. It redirects the flow to the Display Result module in case of success, else invokes the Create File List module.Create File ListThe module gets invoked when the required results are not present in the cache hence, leading to the generation of the file list. The file list is obtained by browsing through the file system.HDFS SearchIt searches for the key in all files mentioned in the file list (which is being obtained from the Create File List module). For each type of file, the required driver is identified and processed so as to call their respective Mappers and Reducers, which are responsible for generating results and storing them in temporary files.Update CacheThe temporary files generated are read and stored in the cache for faster retrieval of results the next time it searches for the same word. Each output file is read and the results are then stored into the cache database. Delete Temporary Files Once the output results are stored into the cache database, this module deletes the temporary files being generated hence, releasing the memory being shared by them.Manage CacheFor every new result that is being added to the cache, this module clears the cache when distinct search keys stored in cache exceeds over a fixed limit (5000). If the number of distinct search keys is greater than the primitive fixed limit, it deletes the Least Frequently Used result from the cache hence, updating the counter of the currently searched key.Display ContentThis module can also be directly invoked by the Caching Module if the required results are found through the same. It displays the results in decreasing order of the number of hits. The module also grants access to the corresponding file through separate links provided for that file.Various Phases in Big Data ProcessingData Collection and StorageBest Practices for Harnessing Big DataUse sandboxingDimensionalize the dataEmbed Analytics into operational workflow/routineGather business requirements before gathering dataOpportunities in Big Data Quantum ComputingMachine LearningHealthcareTHANK YOU !!!Data Filtering, Aggregation and RepresentationData Modeling and AnalysisQuery Processing and Information ExtractionInterpretation and Visualization of Big DataThis phase must support acquisition of data through low, predictable latency in both capturing data and executing short, simple queries.Large corpuses of data are kept on the same original location (disk) rather than moving data from one disk to another. Hadoop is the new tool developed that helps in organizing and integrating data on the original storage cluster.The infrastructure should be able to deliver fast, scale to extreme data volumes and enhance response times and decision making.The data is needed to be analyzed in context of the old to provide new perspectives and solutions on old problems.Big Data is generally noisy and not trustworthy but still, valuable to extract.This phase extracts useful information from 'raw' data through predictive modelling.When considering data visualization, one of the best ways is to present the data graphically by extracting important information so as to communicate the meaning of data.This is required so as to spot values that are generally not obtained while observing raw values, and making the data more user-interpretableTechnology OutlookResult Display