Interested in increasing your knowledge of the Big Data landscape? This course is for those new to data science and interested in understanding why the Big Data Era has come to be. It is for those who want to become conversant with the terminology and the core concepts behind big data problems, applications, and systems. It is for those who want to start thinking about how Big Data might be useful in their business or career. It provides an introduction to one of the most common frameworks, Hadoop, that has made big data analysis easier and more accessible -- increasing the potential for data to transform our world!
At the end of this course, you will be able to:
* Describe the Big Data landscape including examples of real world big data problems including the three key sources of Big Data: people, organizations, and sensors.
* Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection, monitoring, storage, analysis and reporting.
* Get value out of Big Data by using a 5-step process to structure your analysis.
* Identify what are and what are not big data problems and be able to recast big data problems as data science questions.
* Provide an explanation of the architectural components and programming models used for scalable big data analysis.
* Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.
* Install and run a program using Hadoop!
This course is for those new to data science. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments.
Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection because you will be downloading files up to 4 Gb in size.
Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be downloaded and installed free of charge. Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.

CB

Very interesting course. Explained concepts I'd heard of but didn't really know about. It is a foundation course for a specialization, it's not enough by itself but very good as a foundation course.

レッスンから

Data Science: Getting Value out of Big Data

We love science and we love computing, don't get us wrong. But the reality is we care about Big Data because it can bring value to our companies, our lives, and the world. In this module we'll introduce a 5 step process for approaching data science problems.

講師

Ilkay Altintas

Chief Data Science Officer

Amarnath Gupta

Director, Advanced Query Processing Lab

字幕

Step 3: Analyzing Data. Now that you have your data nicely prepared, the next step is to analyze the data. Data analysis involves building a model from your data, which is called input data. The input data is used by the analysis technique to build a model. What your model generates is the output data. There are different types of problems, and so there are different types of analysis techniques. The main categories of analysis techniques are classification, regression, clustering, association analysis, and graph analysis. We will describe each one. In classification, the goal is to predict the category of the input data. An example of this is predicting the weather as being sunny, rainy, windy, or cloudy in this case. Another example is to classify a tumor as either benign or malignant. In this case, the classification is referred to as binary classification, since there are only two categories. But you can have many categories as well, as the weather prediction problem shown here having four categories. Another example is to identify handwritten digits as being in one of the ten categories from zero to nine. When your model has to predict a numeric value instead of a category, then the task becomes a regression problem. An example of regression is to predict the price of a stock. The stock price is a numeric value, not a category. So this is a regression task instead of a classification task. Other examples of regression are estimating the weekly sales of a new product and predicting the score on a test. In clustering, the goal is to organize similar items into groups. An example is grouping a company's customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers, as we see here. Another such example is identifying areas of similar topography, like mountains, deserts, plains for land use application. Yet another example is determining different groups of weather patterns, like rainy, cold or snowy. The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts, surety CDs, also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. If you advertise money market accounts to your customers with CDs, they're likely to open such an account. According to data mining folklore, a supermarket chain used association analysis to discover a connection between two seemingly unrelated products. They discovered that many customers who go to the supermarket late on Sunday night to buy diapers also tend to buy beer, who are likely to be fathers. This information was then used to place beer and diapers close together and they saw a jump in sales of both items. This is the famous diaper beer connection. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals' and doctors' records. Identification of security threats by monitoring social media, email and text data. And optimization of mobile communications network traffic. And optimization of mobile telecommunications network traffic, to ensure call quality and reduce dropped calls. Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you've prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model. Evaluating the model depends on the type of analysis techniques you used. Let's briefly look at how to evaluate each technique. For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system. After you have evaluated your model to get a sense of its performance on your data, you will be able to determine the next steps. Some questions to consider are, should the analysis be performed with more data in order to get a better model performance? Would using different data types help? For example, in your clustering results, is it difficult to distinguish customers from distinct regions? Would adding zip code to your input data help to generate finer grained customer segments? Do the analysis results suggest a more detailed look at some aspect of the problem? For example, predicting sunny weather gives very good results, but rainy weather predictions are just so-so. This means that you should take a closer look at your examples for rainy weather. Perhaps you just need more samples of rainy weather, or perhaps there are some anomalies in those samples. Or maybe there are some missing data that needs to be included in order to completely capture rainy weather. The ideal situation would be that your model platforms very well with respect to the success criteria that were determined when you defined the problem at the beginning of the project. In that case, you're ready to move on to communicating and acting on the results that you obtained from your analysis. As a summary, data analysis involves selecting the appropriate technique for your problem, building the model, then evaluating the results. As there are different types of problems, there are also different types of analysis techniques.