A ForkJoinPool implements ExecutorService but it differs from other (I'll refer them as 'traditional') ExecutorService mainly by virtue of employing work-stealing - all threads in the pool attempt to find and execute subtasks created by other active tasks. It automatically balance the task load between threads, while traditional ThreadPoolExecutor has no mechanism for such kind of load balancing. If no available worker thread is available, tasks will be blocked until a thread becomes available to steal work from those workers who are busy.

ForkJoinPool is an implementation of the Divide and Conquer algorithm in which a central ForkJoinPool executes branching ForkJoinTasks. A ForkJoinTask is a thread-like entity but is much lighter weight than a normal thread. Huge numbers of tasks and subtasks may be hosted by a small number of actual threads in a ForkJoinPool. Because ForkJoinPool is an ExecutorService, its logic is a kind of 'submit a callable' approach in multithreading programming. It,

separates (forks) each large task into smaller tasks;

processes each smaller task in a separate thread (separating those into even smaller tasks if necessary);

joins the results.

If so, how does ForkJoinPool differ from traditional 'summit a callable' approach introduced since Java 5?

ForkJoinTasks in ForkJoinPool are lighter than threads in traditional ExecutorService (thread pool). In Fork/Join, a large number of tasks can be hosted by a smaller number of threads because of work-stealing.

Think of ForkJoinPool as a pool of smaller tasks, whereas traditional ExecutorService as a pool of threads.

To divvy up a bigger task into smaller ones, you extend RecursiveTask and implement a compute() method as follows. Inside compute(), you divide and conquer, then return the result after join. In RecursiveTask compute() is similar to run() method of Thread/Runnable and call() method of Callable interface. For example, the following example recursively execute sub-tasks to calculate Fibonacci series:

[1] The main computation performed by this task. You must define this method, but you should not in general call it directly. Implement compute() as if it is a recursive function that has en ending condition. The compute() of a RecursiveTask returns a V that is introduced in the generalized form of the declared RecursiveTask<V>.

[2] Performs the given ForkJoinTask task, returning its result upon completion and return a V. This V is what compute() method of the RecursiveTask returns. Usually, more tasks were invoked from within compute(). A ForkJoinTask is a thread-like entity that plays similar role as Future thus can be thought as a lightweight form of Future.

Let's zero in on to the compute() method, it is the where you divvy up bigger tasks into smaller ones and invoke to execute each task:

[1] When a task calls the invokeAll() method it waits until the tasks sent to execute through this method finish.[2] The value from the subtasks is obtained with the get() method from the Future interface.

performance comparison - ForkJoinPool vs. ThreadPoolExecutor

With an unevenly distributed workload among tasks/threads, the ForkJoinPool achieves better results, while the traditional ExecutorService suffers under the uneven distribution. However, using ForkJoinPool, if the tasks are broken up into sub-tasks that are too small, performance will suffer. (see this benchmark)