Obviously, there is the problem with progress of the asynchronous
messages.

How can I avoid this problem?

I'm no expert, but I think the problem is that the send is being
"progressed" (advanced) only during MPI calls and MPI_Test doesn't
progress/advance the message very aggressively. The message is
probably being decomposed into chunks and MPI_Test will advance the
message at most one chunk at a time. So:

1) You could decrease the time between MPI_Test calls.
2) You could block (e.g., with MPI_Wait).

It's a tough tradeoff to make. That's bad news... but do you want OMPI
to be making the tough choices here for you? Let's say the sending
process sends a chunk and it takes a little while for the receiver to
process data and make room for you to send some more. During that
waiting time, should the sender return control to the user application,
or stay blocked inside of MPI_Test?