Compute Mean Value with MapReduce

This example shows how to compute the mean of a single variable in a data set using mapreduce. It demonstrates a simple use of mapreduce with one key, minimal computation, and an intermediate state (accumulating intermediate sum and count).

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, select ArrDelay (flight arrival delay) as the variable of interest.

The datastore treats 'NA' values as missing, and replaces the missing values with NaN values by default. Additionally, the SelectedVariableNames property allows you to work with only the selected variable of interest, which you can verify using preview.

preview(ds)

ans =
8x1 table
ArrDelay
________
8
8
21
13
4
59
3
11

Run MapReduce

The mapreduce function requires a map function and a reduce function as inputs. The mapper receives chunks of data and outputs intermediate results. The reducer reads the intermediate results and produces a final result.

In this example, the mapper finds the count and sum of the arrival delays in each chunk of data. The mapper then stores these values as the intermediate values associated with the key 'PartialCountSumDelay'.

The reducer accepts the count and sum for each chunk stored by the mapper. It sums up the values to obtain the total count and total sum. The overall mean arrival delay is a simple division of the values. mapreduce only calls this reducer once, since the mapper only adds a single unique key. The reducer uses add to add a single key-value pair to the output.