OK, I think I understand now. Although the standard doesn't specify what to do in an error case, I agree that having all processes return an error would be a good thing to do.
I've created a ticket for this issue. You can add yourself in the cc list if you want to be notified on the progress.
https://trac.mcs.anl.gov/projects/mpich2/ticket/1119
Thanks for reporting this.
-d
On Oct 15, 2010, at 4:10 AM, Biddiscombe, John A. wrote:
> I don't think I explained very well what I meant ....
>> I call
>> int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
> but internally, MPI detects an error (there is no process to connect to) and aborts the operation, returning (via the error handler). But only rank 0 aborts the operation. the other ranks wait for ever for nothing to happen. Because I want to gracefully handle the abort, I have set the error handler to MPI_ERRORS_RETURN, but ranks 1 -> N never return.
>> I would like MPI to detect an error on all ranks - not just rank 0. I don't want the app to exit at all. If I use the error handler MPI_ERRORS_ARE_FATAL, then all is fine, the app terminates as expected.
>> I am trying to determine if there is a bug in the MPI_Comm_connect routine, because rank 0, detects that there is nobody to connect to, but the other ranks do not and the app hangs for ever. I wondered if inside mpich (I'm using 1.3rc2 on win32) rank 0, should somehow tell ranks 1-N that the connect has failed and they could also abort and return to the user code.
>> Hope I explained it better this time
>> JB
>> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
> Sent: 14 October 2010 20:23
> To: mpich-discuss at mcs.anl.gov> Subject: Re: [mpich-discuss] MPI_Comm_Connect bug?
>>> What version of MPICH2 are you using? What command-line parameters did you use for mpiexec?
>> Normally if one process exits without calling MPI_Finalize, the process manager will abort all processes. There is an option for (the Hydra version of) mpiexec that disables this behavior.
>> If you want all processes to abort, you should call MPI_Abort(MPI_COMM_WORLD, errorcode) to abort all processes.
>> -d
>> On Oct 14, 2010, at 1:11 PM, Biddiscombe, John A. wrote:
>>> To try to catch a problem that occurs when MPI_Comm_connect fails, I wrapped the call with an error handler with the aim of gracefully exiting.
>>>> rank 0, detects an error, aborts and displays the message. But other ranks hang waiting for something to happen. I think that when rank 0 aborts, it should first signal the other ranks to also abort.
>>>> Am I doing it wrong, or is this a bug?
>>>> thanks. snippet below
>>>> JB
>>>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>> int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
>> if (error_code== MPI_SUCCESS) {
>> H5FDdsmDebug("Id = " << this->Id << " MPI_Comm_connect returned SUCCESS");
>> isConnected = H5FD_DSM_SUCCESS;
>> } else {
>> char error_string[1024];
>> int length_of_error_string;
>> MPI_Error_string(error_code, error_string, &length_of_error_string);
>> H5FDdsmError("\nMPI_Comm_connect failed with error : \n" << error_string << "\n\n");
>> }
>> // reset to MPI_ERRORS_ARE_FATAL for normal debug purposes
>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
>>>>>> --
>> John Biddiscombe, email:biddisco @ cscs.ch
>>http://www.cscs.ch/>> CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
>> Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82
>>>>>> _______________________________________________
>> mpich-discuss mailing list
>>mpich-discuss at mcs.anl.gov>>https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss>> _______________________________________________
> mpich-discuss mailing list
>mpich-discuss at mcs.anl.gov>https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss> _______________________________________________
> mpich-discuss mailing list
>mpich-discuss at mcs.anl.gov>https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss