Welcome to the Cloud Computing Applications course, the second part of a two-course series designed to give you a comprehensive view on the world of Cloud Computing and Big Data!
In this second course we continue Cloud Computing Applications by exploring how the Cloud opens up data analytics of huge volumes of data that are static or streamed at high velocity and represent an enormous variety of information. Cloud applications and data analytics represent a disruptive change in the ways that society is informed by, and uses information. We start the first week by introducing some major systems for data analysis including Spark and the major frameworks and distributions of analytics applications including Hortonworks, Cloudera, and MapR. By the middle of week one we introduce the HDFS distributed and robust file system that is used in many applications like Hadoop and finish week one by exploring the powerful MapReduce programming model and how distributed operating systems like YARN and Mesos support a flexible and scalable environment for Big Data analytics. In week two, our course introduces large scale data storage and the difficulties and problems of consensus in enormous stores that use quantities of processors, memories and disks. We discuss eventual consistency, ACID, and BASE and the consensus algorithms used in data centers including Paxos and Zookeeper. Our course presents Distributed Key-Value Stores and in memory databases like Redis used in data centers for performance. Next we present NOSQL Databases. We visit HBase, the scalable, low latency database that supports database operations in applications that use Hadoop. Then again we show how Spark SQL can program SQL queries on huge data. We finish up week two with a presentation on Distributed Publish/Subscribe systems using Kafka, a distributed log messaging system that is finding wide use in connecting Big Data and streaming applications together to form complex systems. Week three moves to fast data real-time streaming and introduces Storm technology that is used widely in industries such as Yahoo. We continue with Spark Streaming, Lambda and Kappa architectures, and a presentation of the Streaming Ecosystem. Week four focuses on Graph Processing, Machine Learning, and Deep Learning. We introduce the ideas of graph processing and present Pregel, Giraph, and Spark GraphX. Then we move to machine learning with examples from Mahout and Spark. Kmeans, Naive Bayes, and fpm are given as examples. Spark ML and Mllib continue the theme of programmability and application construction. The last topic we cover in week four introduces Deep Learning technologies including Theano, Tensor Flow, CNTK, MXnet, and Caffe on Spark.

MS

Very good introduction of application concepts of cloud data computing. Thank You!

UU

Oct 31, 2016

Filled StarFilled StarFilled StarFilled StarFilled Star

good things to learn about real world big problems

Из урока

Module 4: Graph Processing and Machine Learning

In this module, we discuss the applications of Big Data. In particular, we focus on two topics: graph processing, where massive graphs (such as the web graph) are processed for information, and machine learning, where massive amounts of data are used to train models such as clustering algorithms and frequent pattern mining. We also introduce you to deep learning, where large data sets are used to train neural networks with effective results.

Преподаватели

Reza Farivar

Roy H. Campbell

Professor of Computer Science

Текст видео

[SOUND] The application we're going to use for exploring machine learning is going to be called Mahout. It runs on top of Hortonworks again, and we'll use Hadoop and various other tools. HDFS and so on in the Cloud system. So you'll be building your knowledge up. Mahout is a sort of standard tool. It's a [COUGH] bunch of packages that allow you to do all of the different machine learning algorithms that we discussed earlier on. And I'm going to give you several examples of how that operates but first, let's just look at Mahout, what it is. It's again, a foundation project and it's a library project in this case. It's a set of libraries that you can call from an interface to do all of this machine learning. Now, why Mahout? It's combines lots of different open source communities together to build a community, to build examples, to provide scalability. So Mahout if you like, was a community response to the fact that people want to do machine learning. They want to do it on distributed machines. And can you provide libraries to do that and that's exactly what we've got here. So if you look at this structure of machine learning you will find that we have a set of libraries. An application and an interface, the applications work through the interface, just standard sort of API again. And they will pick out one of these different components, libraries, to do whatever is required. And just to sort of go through those so that you know what they are, we've got genetic algorithms, we've got frequency pattern mining, we've got classification, we've got clustering. Recommenders is when you have people recommending restaurants. And you want to rank all of these restaurants, you have utilities like Lucene and Vectorization, you have mathematical components to this that you can do Vectors, Matrices, and SVD's, you can do collections of systems. And Apache Hadoop is one of the components underneath that you use. So how scalable it is? Well, it's built again with the same principles as the rest of Hadoop. It's as fast as efficient, as it's possible to do using the current technology. It's using a distributed algorithm. So that it can scale and again, the emphasis is not necessarily on providing the optimized best system on a particular machine. It's providing a distributed environment that would do what you want so that you can scale it to any level of Azure computation you need and get the results you want. So most Mahout implementations use Map Reduce and work continues to improve Mahout. People continually adding different libraries. You can go visit Apache to find out exactly how that's all happening. So let's look at collaborative filtering, using the example of what Mahout can provide. And we go through different types of machine learning opportunities that Mahout provides and then after then that would go and look at specific examples. So this is collaborative filtering. It can also provide recommendations. It's all user based. It's referring to icons. Users are making recommendations about icons. It's online offline support. And in the diagram here what you see is particularly user Mr. A and his coming in and he has his own set of preferences about what he would like to do, eat, see, buy, read. Tells preferences are compared with preferences of other users in the system which come out of a database. Their preferences are matched against his and then he's provided those set of recommendations. Clustering is yet another application of Mahout. Let's suppose you've got a lots of different measurements, xy measurements, they're perhaps let's say, ants in the garden or you could say stars in the sky. And what you want to do is to say they form groups, you would like to classify them as groups. So this might be sort of one sort of galaxy and this might be another or alternatively this one set of ants and another set of ants or whatever you care to do. Different ways to classify them. The most popular one is K-Means. There are K-Means based on fuzzy probabilities called Fuzzy K-Means. There's density based. There's different types of approaches. But one of the more popular ones is just to take the distance between the points in these diagrams and make a Euclidean measure. So if you're going to go from there to there, what you're going to end up with is square root of the two dimensions, the Euclidean distance. And use that to indicate, if points close to each other, they exist in one cluster. If this is very large, then they are in different clusters. And so what we can then do. Is to partition this into separate groups. Each of those separate groups would represent some particular property perhaps there's some reason why they're all clustered together, and it will give you more information about the data points you have. Another technique is classification, what it does is to put new items into predefined categories. So for example, you might say sports, politics, entertainment, how would you want to classify a particular item. It could be a news item. Perhaps, what you do is to look at the key words in that item and you say, okay, well that's close to sports so I'll put it in sports category or this one's close to entertainment, I'll put it in the entertainment category. It could also build recommenders. If you go and say, well, I want to see all the sports news, then it will be able to recommend what to look at from the news point of view. Lots of different implementations you can make on that. There's Naive Bayes, Complimentary Naive Bayes, Decision Forests, Linear Regression. So it has classification techniques to do all of these. And last, frequent pattern mining. Here we've got shopping cart. You go around and you're going to buy things, what we can do is identifies what people actually, sort of typically buy. So they might just go in and buy milk, bread, cheese and then they might have some extra items around it. Trying to identify what are the common items. So if somebody buys bread, then you could offer them milk and cheese because most people do that, that could be very helpful to the persons, a reminder of what to buy. It can also sort of speed things up. It can do a great deal of optimization in where you place the milk, bread, and cheese in the supermarket. So a lot of different benefits. So here we've got several different pattern, several different techniques, all provided by Mahout. And now what we're going to do is look at example. [MUSIC]