Thanks for your detailed info. In my case, I expect to spawn multiple threads from each MPI process. I could use MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED to do so - I think MPI_THREAD_MULTIPLE is not supported on InfiniBand, which I am using. Currently, I use OpenMPI + Boost::Thread - no plan to shift to Boost::MPI yet.

I still have a couple of questions to ask:

1. In both MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED modes, the MPI calls are serialized at only one thread (in the former case, only the rank main thread can make MPI calls, while in the latter case the threads need to be coordinated so that only one thread makes MPI calls at a time). So are there any performance implications associated with choosing between FUNNELED or SERIALIZED?

2. My current code uses many MPI collective calls (gather/scatter/broadcast, etc.). It seems that these collective calls have some negative impact on performance because ALL MPI processes need to wait on each of these calls. I would like to explore the idea of decoupling computation from MPI communication - so if one thread of each MPI rank is blocked at a MPI call, the other threads can still make progress. I am wondering if I could still make MPI calls from the other non-blocked threads using MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED mode (assuming that the blocked thread is the main thread in the rank)?

I'm a regular reader of this list but seldom a poster. In this case however I might actually be qualified to answer some questions or provide some insight given I'm not sure how many other folks here use Boost.Thread. The first question is really what sort of threading model you want to use with MPI, which others here are probably more qualified to advise you on.

In our applications we're using Boost.Thread with MPI_THREAD_MULTIPLE, which is a not all-together enjoyable experience because the openib BTL lacks support for thread multiple (at least as of the last time I checked). That being said, Boost.Thread behaves just like any pthread code on the linux clusters we run on, as well as one BlueGene/P. With MPI_THREAD_SERIALIZED writing hybrid-parallel code is pretty painless. Most of the work required involved adding two-stage collectives such that threads first perform collectives locally and then a single thread participates in the MPI collective operation.

If you end up using Boost.MPI you could probably even write your own wrappers to encapsulate the local computation required for MPI collective operations. Unfortunately Boost.MPI currently lacks full support for even MPI-2 but if it includes the subset of functionality you need it may be worthwhile. Extensions are fairly straightforward to implement as well.

I've implemented a few different approaches to MPI + threading in the context of Boost, from explicit thread management to thread pools, and currently a complete runtime system. Most of it is research code, though there's no reason it couldn't be released, and some of it probably will be eventually. If you'd like to describe your intended use case I'm happy to offer any advice I can based on what I've learned.

Cheers,
Nick

On Apr 22, 2013, at 3:25 PM, Thomas Watson wrote:

> Hi,
>
> I would like to create a pool of threads (using Boost::Thread) within each OpenMPI process to accelerate my application on multicore CPUs. My application is already built on OpenMPI, but it currently exploits parallelism only at the process level.
>
> I am wondering if anyone can point me to some good tutorials/documents/examples on how to integrate Boost multithreading with OpenMPI applications?
>
> Thanks!
>
> Jacky