(2)in a test involving the scaling of an array in chunks of equal size, I observe the expected behavior consistently(!) only when the size of messages is below a certain threshold (which itself seems varies between MPI inplementations.) If the test below is repeated several times, with longer messages, it fails some of the times. As far as I can tell, when there is failure, the data is good in functions FSOW, FWORK, FREAP, but one of several chunks of the array "x" is incorrect on exit... I also noticed that some of jobs/chunks are not submitted (only some of them,) when there is failure, and some are processed (in fwork) twice. Typically, if a job is not submitted, the following one is submitted twice.

The test is a simplified template for a radar subaperture focusing algorithm which dispatchs chunks of data for parallel processing. mpool could be very useful for processing large "raster" data sets in parallel.... The plan was to have yorick handle the parallelization and have a c-plugin handle the bulk of the low level processing.

You misdiagnosed this. It is another unintended consequence of the "bug" I agreed to "fix" for Eric: If the first reference to a symbol in a function is as a keyword, the original yorick implicitly declared that statement extern. It is now implicitly declared local. The offending line in this case is mpool.i:414. Adding "extern vsave;" as the first line of the mpool_test function fixes the problem. As I said, I'll regret this change... This type of reference really needs to leave the symbol undecided.

Sat Jan 29, 2011 11:01 am

Thierry Michel

Yorick Guru

Joined: Sat Jan 22, 2005 2:44 pmPosts: 86Location: Pasadena, CA

Re: mpy mpool: problem with message size

I have now replicated the second problem -- mpool errors when message length increases -- on several linux systems and MPI implementations (MPICH and openmpi.) In short: when the mpi message size increase to several hundreds to thousands doubles, some jobs are not submitted, then the following chunk is submitted twice, and this lack of synchronization cascades through all jobs (see code in original post.) I would appreciate help in debugging the problem, or even just some advice on how to go about it. I have spent some time on it, but I don not see a clear way forward. The application which I need to parallelize is essentially a tiled 2-D correlation with a space variant kernel. The MPI message size is dictated by tile size. The plan was to move the computationally intensive tile processing to a C plugin and to let mpy handle MPI jobs, as I would rather keep that out of the lower level code.

Unless you have access to totalview, the only way I found to debug this was to start mpy, then attach a copy of gdb to each of the processes as they are blocked waiting for input (from you at the keyboard in the case of rank 0, from rank 0 for the others). Then set appropriate breakpoints, continue all the jobs, and give rank 0 the keyboard input to start your parallel task. Before you do all of this, you should strive mightily to produce an example that exhibits your bug with only two processes, so you only have to deal with two instances of gdb. Setting good breakpoints is even more difficult that usual. Let me know if you make any progress. I haven't had time to look at this yet. I'm most interested in mpich, if you have nothing else to decide which MPI to test. Good luck.

Sun May 08, 2011 1:42 pm

Thierry Michel

Yorick Guru

Joined: Sat Jan 22, 2005 2:44 pmPosts: 86Location: Pasadena, CA

Re: mpy mpool: problem with message size

Thank you for the hints. Now I do remember an earlier post in which you had suggested using totalview (no access to that yet.) I'll post an update if I make progress.

I can produce a similar problem with 2 CPUs, but only if using "self=1" (with testmp.i as above:)

I believe I have found and fixed this bug. Please retry with latest github source (commit 7c3b8b84d359b36ccb64).

There was a serious problem that would have broken pretty much any mpy program eventually: The mp_send function did not block until all messages were sent, because I failed to pass the correct count of sent messages to the wait routine. Thus, the rank which called mp_send could continue processing and reuse the buffer memory before MPI had copied the message to the receiving process (or into its own buffers). This is most likely to cause problems with large messages because MPI tends to immediately copy small messages into its own buffers.

With any luck, this will make many of the growing pains mpy has had disappear. Please keep posting mpy problems here -- the bugs are quite probably real; it is very immature code. Thierry did a model job of reducing the bug to a manageable test code that I could use to replicate the problem. Good work.

Nevertheless, let me suggest the following form for a file containing a function which calls mpool.i, which you can use as a template for your own pool of tasks functions. I've stripped out the vpack/vsave choice, because you should only choose vsave if you need to pass structs or pointers (and you should avoid needing to do that), and the self=1 choice, which typically will slow you down by more than the extra processor can speed you up (by causing everyone else to block while rank 0 is working on a task). The mpbug function is intended to be called in serial mode from rank 0. (From the keyboard, or a -i command line option, or a serial include.)

Code:

/* All ranks must read this, so use mp_include or -j option on * command line. Therefore, this will always be executed in parallel * mode, and the ordinary require statement is the correst way to * handle dependencies. */require, "mpool.i";

Dave, thank you very much. I had a vague idea that it had to do with blocking, but I am not confident that I would have cracked it, even with plenty of time. This opens up a lot of possibilities. Thank you.

Who is online

Users browsing this forum: No registered users and 1 guest

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum