Combiner

Have you ever heard about such technologies as HDFS, MapReduce, Spark? Always wanted to learn these new tools but missed concise starting material? Don’t miss this course either!
In this 6-week course you will:
- learn some basic technologies of the modern Big Data landscape, namely: HDFS, MapReduce and Spark;
- be guided both through systems internals and their applications;
- learn about distributed file systems, why they exist and what function they serve;
- grasp the MapReduce framework, a workhorse for many modern Big Data applications;
- apply the framework to process texts and solve sample business cases;
- learn about Spark, the next-generation computational framework;
- build a strong understanding of Spark basic concepts;
- develop skills to apply these tools to creating solutions in finance, social networks, telecommunications and many other fields.
Your learning experience will be as close to real life as possible with the chance to evaluate your practical assignments on a real cluster. No mocking, a friendly considerate atmosphere to make the process of your learning smooth and enjoyable.
Get ready to work with real datasets alongside with real masters!
Special thanks to:
- Prof. Mikhail Roytberg, APT dept., MIPT, who was the initial reviewer of the project, the supervisor and mentor of half of the BigData team. He was the one, who helped to get this show on the road.
- Oleg Sukhoroslov (PhD, Senior Researcher at IITP RAS), who has been teaching MapReduce, Hadoop and friends since 2008. Now he is leading the infrastructure team.
- Oleg Ivchenko (PhD student APT dept., MIPT), Pavel Akhtyamov (MSc. student at APT dept., MIPT) and Vladimir Kuznetsov (Assistant at P.G. Demidov Yaroslavl State University), superbrains who have developed and now maintain the infrastructure used for practical assignments in this course.
- Asya Roitberg, Eugene Baulin, Marina Sudarikova. These people never sleep to babysit this course day and night, to make your learning experience productive, smooth and exciting.

教学方

Ivan Puzyrevskiy

Emeli Dral

Evgeniy Riabenko

Alexey A. Dral

Pavel Mezentsev

脚本

Hello. The target of this lesson, is to teach you how to tune MapReduce application. MapReduce word count application is a really good example. It has quite a number of parameters that you have never thought of before. Let us consider the following line of input, word, word, then the word a and so on. If you use the simplest implementation, then you will get the following pairs. All this data will be serialized to the local disk before and after the transmission over the network during shuffle and sort phase. As you can mention, there can be a lot of repetitions. So you would better squash the repeated items. For instance, from pairs word 1, word 1 we can use word 2. This approach can help you to dramatically change the usage of these IO operations and network bandwidth. You can easily do it with the following exhibit of the code. I hope you're familiar with the standard python collections model. Otherwise, you are likely to learn something new and handy. If you use this mapper instead of the old one, then you can see the improvement in calculations. But you can even be more aggressive. You can combine the output of several and map functions calls. This functionality, even has a special name in Hadoop MapReduce framework. It is called combiner. In this slide, you can see how you squash several items into one, to be more precise, a combiner has the following interface. It expects an input in the form of the reducer input and it has the same output signature as a mapper. So combiner can be applied arbitrarily number of times between map and reduce phases. In the word count application, there is no difference between the combiner and the reducer. So you can easily call it with the following arguments. When job finishes, you will be able to see encounters how many records were processed by the combiner. In our word count example, we used the same reducer in place of the combiner. You're able to do it because types of key value pairs from the reducer and types of intermediate key value of pairs were the same. But it is not a mandatory requirement. Sometimes you need to write your own combiner with a different signature. I'm going to show you an example, how to speed up the computation of mean values with the help of the combiner. Imagine you have the same Wikipedia sample, and now you're going to count how many times on average you see a word in an article. For simplicity, you are going to average over the number of articles contained in this or that word. There are no changes in our mapper.py. You just count the number of words in an article and print them out. From the reducer's point of view, you have to memorize not only the number of occurrences but also the number of articles. So you will be able to average over them. When you try to use the combiner, then you see a dilemma. You cannot just average over a partial output. If you do this, then you lose information about how many articles we have processed. Therefore, the outcome of the reducer will not be correct. Let me change our mapper availer type to a pair containing the number of articles processed and the cumulative amount of words. In this case, you can easily derive the mean value by dividing the cumulative amount of words, by the amount of articles. Here are the corresponding changes in reducer.py. Here is the code that does some spellers for each code in it, in the pair. It could help us to speed up calculations, for the whole MapReduce job, as you will use less IO resources. Another one example, is median. To calculate the median value precisely, you have to get the whole dataset in place. So, the combiner is out of help in this case. Summing up, in this video you have learned, how to construct combiner signature given mapper and reducer. You have also learned how to call a MapReduce application with a combiner from CLI. You have see an example of Python combiner implementation. So you'll be able to write your own. Finally, it's not always possible to speed up calculations with the combiner.