Parallel Computing Exercises: Foreach and DoParallel (Part-2)

In general, foreach is a statement for iterating over items in a collection without using any explicit counter. In R, it is also a way to run code in parallel, which may be more convenient and readable that the sfLapply function (considered in the previous set of exercises of this series) or other apply-alike functions.
Apart from being able to run code in parallel, the R’s foreach has some other differences from the standard for loop. Specifically, the foreach statement:

allows to iterate over several variables simultaneously,

returns a value (a list, a vector, a matrix, or another object),

is able to skip some iterations based on a condition (the last two properties make it similar to the list comprehension, which is present in Python and some other languages),

has a special syntax that includes operators %do% (see an example in Exercise 1), %dopar%, and %:%.

The first six exercises in this set allow to train in performing basic operations with the foreach statement, and the last four ones show how to run it in parallel using multiple CPU cores on one machine. The task will be to parallelize identical operations on a set of files (the zipped data files can be downloaded here). It is assumed that your computer has two or more CPU cores.
The exercises require the packages foreach, doParallel, and parallel. The first two packages have to be installed, and the last one comes with the standard R distribution. The packages doParallel and parallel are necessary to run foreach in parallel.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
The foreach function (from the package of the same name) is typically used as a part of a special statement. In its simple form, the statement looks like this:

result

The statement above consists of three parts:

foreach(i = 1:3) – a call to the foreach function, with an argument that includes an iteration variable (i) and a sequence to be iterated over (1:3),

%do% – a special operator,

sqrt(i): an R expression, which represents an operation to be performed over the iteration variable (this part of the statement is equivalent to the body of the loop).

The code iterates over the sequence, applies an operation defined in the expression to each element of the sequence, and stores the output in the result variable.
Note that if the expression extends over several lines it has to be enclosed in curly braces. The use of the iteration variable is not mandatory: if you just want to repeat the expression n times not passing anything to that expression you can use only a sequence of the length n as input to foreach.
In this exercise:

Run the code above, print the result object, and find to which class it belongs.

Use the foreach function to reverse the result. I.e. write a line of code that receives the result object as an input, and outputs the original sequence. Print the sequence.

Exercise 2
The foreach function allows for the use of several iteration variables simultaneously. They are passed to the function as arguments, and are separated by commas.
Run the foreach function with two iteration variables to get a sequence of their sums. The variables have to iterate over a vector of integers from 1 to 3, and a vector of 5 integers of value 10. Print the result.
(Tip: if you want to use an arithmetic operator to calculate the sum then the expression must be placed in parentheses or curly braces).
What is the length of the resulting object? How does the function deal with the vectors of different length?

Exercise 3
The package iterators provides several functions that can be used to create sequences for the foreach function. For example, the irnorm function creates an object that iterates over vectors of normally distributed random numbers. It is useful when you need to use random variables drawn from one distribution in an expression that is run in parallel.
In this exercise, use the foreach and irnorm functions to iterate over 3 vectors, each containing 5 random variables. Find the largest value in each vector, and print those largest values.
Before running the foreach function set the seed to 1234.

efficiently organize your workflow to get the best performance of your entire project

get a full introduction to using R for a data science project

And much more

Exercise 4
By default the foreach function returns a list. But it can also return sequences of other types. This requires changing the value of the .combine parameter of the function. This exercise will train how to use this parameter.
As in the previous exercise, use the foreach and irnorm functions to iterate over 3 vectors, each containing 5 random variables. But now use an expression that returns all variables generated by irnorm. Pass the .combine parameter to the foreach function with value 'c'. Print the result, and find its class and length.
Then run the code again with the 'cbind' value assigned to the .combine parameter. Print the result, find its class and size.
Note that 'c' and 'cbind' are R functions from the base package. Other functions (including user-written ones) can be used as well to combine the outputs of the expression.

Exercise 5
The results of the expression placed after the %do% operator can be combined in different ways. Look at the documentation for the foreach function to find what value has to be assigned to the .combine parameter to sum the values produced by the expression in each iteration.
Run the code used in previous exercise with that value assigned to the .combine parameter, and print the result.
Before running the code set the seed to 1234.

Exercise 6
The sequence passed to the foreach function can be filtered so that the expression after %do% is applied only to a part of the sequence. This is done using a syntax like this:

result ‹- foreach(i = some_sequence) %:% when(i › 0) %do% sqrt(i)

You can notice that the %:% operator and the when function, which contains a Boolean expression involving the iteration variable, are added to a standard foreach statement.
Modify the example above to get a vector of logs of all even integers in the range from 1 to 10. Print the result.

Exercise 7
Now let’s parallelize the execution of the foreach function. We’ll use it to read similarly named files, and perform identical calculations on data from each file.
As a first step, write a function to be run in parallel. The function takes an integer as input, and performs the following actions:

Create a string (character vector) with a file name by concatenating constant parts of the name (test_data_, .csv) with the integer (example of possible result when 1 is used as integer: test_data_1.csv).

Read the file with the obtained name from the current working directory into a data frame.

Calculate mean values for each column in the data frame.

Return a vector of those values.

Exercise 8
The second step is to create a backend for parallel execution:

Make a cluster for parallel execution using the makeCluster function from the parallel package; pass the size of the cluster (i.e. the number of CPU cores that you want to be used in computations) as an argument to this function .

Register the cluster with the registerDoParallel function from the doParallel package.

Note that by default the makeCluster function creates a PSOCK cluster, which is an enhanced version of the SOCK cluster implemented in the snow package. Accordingly, the PSOCK cluster is a pool of worker processes that exchange data with the master process via sockets. The makeCluster function can also create other types of clusters.

Exercise 9
The last step is to run the foreach function to read and analyze 10 test files (contained in this archive) using the function created in Exercise 7. Combine the outputs of that function using rbind.
Perform this task twice:

with %do% operator, which evaluates the expression sequentially, and

with %dopar% operator, which evaluates the expression in parallel.

In both cases, measure the execution time using the the system.time function. Print the result of the last run.
IMPORTANT: after completing parallel computations stop the cluster (created in Exercise 8) using the stopCluster function from the parallel package.

Exercise 10
Modify the code written in the Exercise 7 and Exercise 9 to calculate the mean and the variance of values contained in the first column in each file. The resulting object must be a two-column matrix with the first column representing means, and the second column describing variances (the number of rows must be equal to the number of files).
Repeat the actions listed in Exercise 8 to prepare a cluster for parallel execution, then run the modified code in parallel.
Print the result.
Stop the cluster.