Compression

Have you ever heard about such technologies as HDFS, MapReduce, Spark? Always wanted to learn these new tools but missed concise starting material? Don’t miss this course either!
In this 6-week course you will:
- learn some basic technologies of the modern Big Data landscape, namely: HDFS, MapReduce and Spark;
- be guided both through systems internals and their applications;
- learn about distributed file systems, why they exist and what function they serve;
- grasp the MapReduce framework, a workhorse for many modern Big Data applications;
- apply the framework to process texts and solve sample business cases;
- learn about Spark, the next-generation computational framework;
- build a strong understanding of Spark basic concepts;
- develop skills to apply these tools to creating solutions in finance, social networks, telecommunications and many other fields.
Your learning experience will be as close to real life as possible with the chance to evaluate your practical assignments on a real cluster. No mocking, a friendly considerate atmosphere to make the process of your learning smooth and enjoyable.
Get ready to work with real datasets alongside with real masters!
Special thanks to:
- Prof. Mikhail Roytberg, APT dept., MIPT, who was the initial reviewer of the project, the supervisor and mentor of half of the BigData team. He was the one, who helped to get this show on the road.
- Oleg Sukhoroslov (PhD, Senior Researcher at IITP RAS), who has been teaching MapReduce, Hadoop and friends since 2008. Now he is leading the infrastructure team.
- Oleg Ivchenko (PhD student APT dept., MIPT), Pavel Akhtyamov (MSc. student at APT dept., MIPT) and Vladimir Kuznetsov (Assistant at P.G. Demidov Yaroslavl State University), superbrains who have developed and now maintain the infrastructure used for practical assignments in this course.
- Asya Roitberg, Eugene Baulin, Marina Sudarikova. These people never sleep to babysit this course day and night, to make your learning experience productive, smooth and exciting.

教学方

Ivan Puzyrevskiy

Emeli Dral

Evgeniy Riabenko

Alexey A. Dral

Pavel Mezentsev

脚本

Hello, in this video, I will give you a brief description of one more tune parameter which can dramatically increase the performance of your MapReduce application. That is compression of course, and the guy in the fifth row gets the prize for his right answer. The prize is the ticket in the first row. Congrats, young man. You can balance the process and capacity by the data compression. In the first week, my colleague Yvonne provided you with the framework to analyze data compression algorithms. I am going to remind you the basic concepts, and to add something you've taken into consideration, data transfer in Hadoop MapReduce. Data compression is essentially a trade-off between the disk I/O required to read and write data. The network bandwidth required to send data across the network. And the in-memory calculation capacity, where the in-memory calculation capacity is a composite of speed and usage of CPU and RAM. The correct balance of these factors depends on the characteristics of your cluster, your data, your applications, or usage patterns, and the weather forecast. Data located in HDFS can be compressed. Nobody expected that. What's more, there is a shuffle and sort phase between map and the reduce where you can compress the intermediate data. This is exactly the place where your optimization skills can unfold. You only need to take the following comparison table for the reference and reserve enough time for experiments. Splittable column means that you can cut a file at any place and find the location for the next or the previous valid record. For instance, it is useful to parallelize the work for mappers. Of course, all the compression formulas have their own pros and cons. DEFLATE compression, decompression algorithm is used in DEFLATE and gzip files. gzip file is a deflate file with extra headers and a footer. bzip is more aggressive for space requirements, but consequently, it's slower during the compression. The major benefit of bzip files compared to gzip ones is that they are splittable. lzo files can be used in a common Hadoop scenario, where you read data far more frequently than write. You can provide index files for lzo files to make them splittable. There is even more faster decompression algorithm called Snappy, but as you can see it has its own price. You will only be able to split this file records. As a side note, native libraries that provide implementation of compression and decompression functionality, usually also support an option to choose a trade-off between speed or space optimization. A Hadoop codec is an implementation of a compression, decompression algorithm. There are number of built-in codecs for the aforementioned compression options. You can specify the compression parameters for intermediate data for output or for both. CLI arguments to tune the MapReduce application, you can see in the slide. Running tests is essential to see what options are the most suitable for your data processing patterns. Here are several rules of thumb. gzip or bzip are a good choice for cold data, which is accessed infrequently. bzip produce more compression than gzip for some kinds of files at the cost of some speed when compressing and decompressing. Snappy or lzo are a better choice for hot data, which is accessed frequently. Snappy often performs better than lzo. For MapReduce, we can use bzip and lzo formats, if you would like to have your data splittable. Snappy and gzip formats are not splittable at file level compression. But you can use block level compression and splittable container formats such as Avro or SequenceFile. In that case, we will be able to process the blocks in parallel using MapReduce. Henceforth, you know how to tune your MapReduce application, and what compression options are available.