Featured in AI, ML & Data Engineering

In this article, author shows how to use big data query and processing language U-SQL on Azure Data Lake Analytics platform. U-SQL combines the concepts and constructs both of SQL and C#. It combines the simplicity and declarative nature of SQL with the programmatic power of C# including rich types and expressions.

Featured in Culture & Methods

The book Agile Leadership in Practice - Applying Management 3.0 by Dominik Maximini is an experience report of the agile transformation journey of NovaTec. Maximini shares his experiences from applying principles and practices from Management 3.0, success stories, failure stories, and learnings from experiments.

Featured in DevOps

Yuri Shkuro presents a methodology that uses data mining to learn the typical behavior of the system from massive amounts of distributed traces, compares it with pathological behavior during outages, and uses complexity reduction and intuitive visualizations to guide the user towards actionable insights about the root cause of the outages.

It starts with a fairly straightforward usage of MapReduce as a general purpose parallel execution framework, which can be applicable to many implementations requiring leveraging of large clusters for compute and data intensive calculations, including physical and engineering simulations, numerical analysis, performance testing, etc. The next group of algorithms, commonly used in Log Analysis, ETL and Data Querying, includes counting and summing, data collating (based on specific functions), filtering, parsing, validation and sorting.

The second large group of MapReduce patterns, discussed by Katsov includes multiple relational MapReduce patterns, often used by data warehousing applications. These patterns are widely leveraged by Hive and Pig implementations and include predicate/function based data selection, data projection, data union, difference and intersection and groupBy aggregations. A separate discussion is dedicated to implementing data joins and include such algorithms as repartition joins and replicated joins

Moving further up the chain of complexity, the article discusses more complex MapReduce processing algorithms, including graph processing, search algorithms (breadth first search), page rank and data aggregation algorithms that can be leveraged in graph analysis, web indexing and general search applications. It also covers common text analysis and market analysis use cases requiring cross correlation calculation. This part covers both "pairs" and "stripes" design patterns and their comparative merits.

Finally, Katsov provides a good bibliography of more complex MapReduce implementations in the field of machine learning.

Most of the algorithms, described in the article are accompanied by pseudo code and basic information for their applicability, advantages and disadvantages and some real world use cases.

Many people today are still struggling with applicability of Hadoop and MapReduce for solving their business problems. Some still consider it a "technical approach in search of a business problem". The article is an important step in filling an existing void in the field of MapReduce algorithms, use cases and design patterns. It shows MapReduce’s power far beyond infamous "word count" and the ways it can be leveraged for solving a wide range of practical problems.