Channels

Services

Scenario 1: MapReduce

The application we will look at is a document management application that allows users to share and tag documents. Each document is therefore associated with a quantity of users and tags. One possible job would be to calculate the total number of tags appended to documents visible to each user. This information could, for example, enable each user to be shown a list of search suggestions consisting of tags from all of their documents.

This is a typical use case for MapReduce. A MapReduce job first runs through the input data (all documents) and extracts specific data (the users and tags). It then reduces this data to a result (a set of tags for each user).

The ThreadPoolExecutor uses a fixed size thread pool with numThreads worker threads. The volume of documents is split into numThreads part volumes of equal size, for scheduling as tasks for processing. Each task performs a part of the MapReduce job for its documents. In a final step, once the ThreadPoolExecutor has completed all tasks, the results are combined together.

The ForkJoinPool is likewise initialised with numThreads worker threads. numThreads here describes the approximate number of worker threads which actively process tasks. This is an important point, as a thread may have to wait for the results of forked tasks before it can complete the original task. Although, rather than just waiting, the thread initially attempts to execute selected tasks from other threads, if this is not possible, it switches to a rest state. In this situation, the ForkJoinPool reserves the right to make up for the shortfall with an additional worker thread. Consequently, for a ForkJoinPool, the number of threads can only be controlled approximately.

In order to be able to keep the formulation of MapReduce jobs for the ForkJoinPool general, in this example, an abstraction is achieved by defining two interfaces:

The type parameter T represents the output type for the MapReduce job as a whole. Input<T> represents the input data for a task and can provide information on whether the task should be split further or computed directly (shouldBeComputedDirectly()). Input<T> can therefore be computed directly (computeDirectly()) or split into new tasks (split()). Output<T> represents the result of a task and supports both combining several such results into a shared result (reduce()) and querying the overall result (getResult()).

These two interfaces can be used to define a MapReduceTask<T> class. Because this needs to return a result, it is derived from RecursiveTask.

As soon as a worker thread executes a task, the compute() method is called. During processing, the first action is to check whether the task should be computed directly. If not, the input is split into new tasks, all but one of which are added to the local task queue using fork(), where they are available for processing by other threads. The remaining task is executed by calling compute() (which can in turn result in further splitting). Finally, the task waits for the results of the forked tasks using join(), which it joins back together and returns as a whole.

The actual "business logic" (calculation of the tags) is located in the implementations of Input<T> and Output<T>, which can be found in the GitHub repository.