No algorithmic changes were added, so this will not fix the problem,
just give you a perspective of it's activity.

-- Josh

On May 14, 2008, at 1:11 PM, Josh Hursey wrote:

> Tamer,
>
> How much communication does your application tend to do? As reported
> below if there is a lot of communication between checkpoints then it
> may take a while to checkpoint the application since the current
> implementation of the coordination algorithm checks every message at
> checkpoint time. So what you are seeing might be that the checkpoint
> is taking an extremely long time to clear the channel.
>
> I have a few things in the works that attempt to fix this problem.
> They are not ready just yet, but I'll make it known when they are. You
> can get some diagnostics be setting "-mca crcp_coord_verbose 10" on
> the command line, but it is fairly course gained at the moment (I have
> some improvements in the pipeline here as well).
>
> Cheers,
> Josh
>
> On May 13, 2008, at 3:42 PM, Tamer wrote:
>
>> Hi Josh: I am currently using openmpi r18291 and when I run a 12
>> task job on 3 quad core nodes I am able to checkpoint and restart
>> several times at the beginning of the run, however, after a few
>> hours, when I try to checkpoint the code just hangs and it just
>> won't checkpoint and won't give me an error message. Has this
>> problem been reported before? All the required executables and
>> libraries are in my path.
>>
>> Thanks,
>> Tamer
>>
>>
>> On Apr 29, 2008, at 1:37 PM, Sharon Brunett wrote:
>>
>>> Thanks, I'll try the version you recommend below!
>>>
>>> Josh Hursey wrote:
>>>> Your previous email indicted that you were using r18241. I
>>>> committed
>>>> in r18276 a patch that should fix this problem. Let me know if you
>>>> still see it after that update.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:
>>>>
>>>>> Josh,
>>>>> I'm also having trouble using ompi-restart on a snapspot made
>>>>> from a
>>>>> run
>>>>> which was previously checkpointed. In other words, restarting a
>>>>> previously restarted run!
>>>>>
>>>>> (a) start the run
>>>>> mpirun -np 16 -am ft-enable-cr ./a.out
>>>>>
>>>>> <---do an ompi-checkpoint on the mpirun pid from (a) from another
>>>>> terminal--->>
>>>>>
>>>>> (b) restart the checkpointed run
>>>>>
>>>>> ompi-restart ompi_global_snapshot_30086.ckpt
>>>>>
>>>>> <--do an ompi-checkpoint on mpirun pid from (b) from another
>>>>> terminal---->>
>>>>>
>>>>> (c) restart the checkpointed run
>>>>> ompi-restart ompi_global_snapshot_30120.ckpt
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 12 with PID 30480 on node shc005
>>>>> exited
>>>>> on signal 13 (Broken pipe).
>>>>> --------------------------------------------------------------------------
>>>>> -bash-2.05b$
>>>>>
>>>>> I can restart the previous (30086) ckpt but not the latest one
>>>>> made
>>>>> from
>>>>> a restarted run.
>>>>>
>>>>> Any insights would be appreciated.
>>>>>
>>>>> thanks,
>>>>> Sharon
>>>>>
>>>>>
>>>>>
>>>>> Josh Hursey wrote:
>>>>>> Sharon,
>>>>>>
>>>>>> This is, unfortunately, to be expected at the moment for this
>>>>>> type of
>>>>>> application. Extremely communication intensive applications will
>>>>>> most
>>>>>> likely cause the implementation of the current coordination
>>>>>> algorithm
>>>>>> to slow down significantly. This is because on a checkpoint Open
>>>>>> MPI
>>>>>> does a peerwise check on the description of (possibly) each
>>>>>> message
>>>>>> to
>>>>>> make sure there are no messages in flight. So for a huge number
>>>>>> of
>>>>>> messages this could take a long time.
>>>>>>
>>>>>> This is a performance problem with the current implementation of
>>>>>> the
>>>>>> algorithm that we use in Open MPI. I've been meaning to go back
>>>>>> and
>>>>>> improve this, but it has not been critical to do so since
>>>>>> applications
>>>>>> that perform in this manner are outliers in HPC. The coordination
>>>>>> algorithm I'm using is based on the algorithm used by LAM/MPI,
>>>>>> but
>>>>>> implemented at a higher level. There are a number of improvements
>>>>>> that
>>>>>> I can explore in the checkpoint/restart framework in Open MPI.
>>>>>>
>>>>>> If this is critical for you I might be able to take a look at
>>>>>> it, but
>>>>>> I can't say when. :(
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>>>>>>
>>>>>>> Josh Hursey wrote:
>>>>>>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>>>>>>
>>>>>>>>> I'm finding that using ompi-checkpoint on an application
>>>>>>>>> which is
>>>>>>>>> very cpu bound takes a very very long time. For example,
>>>>>>>>> trying to
>>>>>>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can
>>>>>>>>> take
>>>>>>>>> more than an hour. The problem is not where I'm dumping
>>>>>>>>> checkpoints
>>>>>>>>> (I've tried local and an nfs mount with plenty of space, and
>>>>>>>>> cpu
>>>>>>>>> intensive apps checkpoint quickly).
>>>>>>>>>
>>>>>>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>>>>>>
>>>>>>>>> Is this condition common and if so, are there possibly mca
>>>>>>>>> paramters
>>>>>>>>> which could help?
>>>>>>>> It depends on how you configured Open MPI with checkpoint/
>>>>>>>> restart.
>>>>>>>> There are two modes of operation: No threads, and with a
>>>>>>>> checkpoint
>>>>>>>> thread. They are described a bit more in the Checkpoint/Restart
>>>>>>>> Fault
>>>>>>>> Tolerance User's Guide on the wiki:
>>>>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR>>>>>>>>
>>>>>>>> By default we compile without the checkpoint thread. The
>>>>>>>> restriction
>>>>>>>> he is that all processes must be in the MPI library in order to
>>>>>>>> make
>>>>>>>> progress on the global checkpoint. For CPU intensive
>>>>>>>> applications
>>>>>>>> this
>>>>>>>> may cause quite a delay in the time to start, and subsequently
>>>>>>>> finish,
>>>>>>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>>>>>>
>>>>>>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>>>>>>> threads-
>>>>>>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>>>>>>> thread
>>>>>>>> that runs with each application process. This thread is fairly
>>>>>>>> light
>>>>>>>> weight and will make sure that a checkpoint progresses even
>>>>>>>> when
>>>>>>>> the
>>>>>>>> process is not in the Open MPI library.
>>>>>>>>
>>>>>>>> Try enabling the checkpoint thread and see if that helps
>>>>>>>> improve
>>>>>>>> the
>>>>>>>> checkpoint time.
>>>>>>> Josh,
>>>>>>> First...please pardon the blunder in my earlier mail. Comms
>>>>>>> bound
>>>>>>> apps
>>>>>>> are the ones taking a while to checkpoint, not cpu bound. In any
>>>>>>> case, I
>>>>>>> tried configuring with the above two configure options but
>>>>>>> still no
>>>>>>> luck
>>>>>>> on improving checkpointing times or gaining completion on
>>>>>>> larger mpi
>>>>>>> task runs being checkpointed.
>>>>>>>
>>>>>>> It looks like the checkpointing is just hanging. For example, I
>>>>>>> can
>>>>>>> checkpoint a 2 way comms bound code (1 task on two nodes) ok.
>>>>>>> When I
>>>>>>> ask
>>>>>>> for a 4 way run on 2 nodes, 30 minutes after the ompi-
>>>>>>> checkpoint PID
>>>>>>> only see 1 ckpt directory with data in it!
>>>>>>>
>>>>>>>
>>>>>>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>>>>>>> -bash-2.05b$ ls -l *
>>>>>>> opal_snapshot_0.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_1.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_2.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_3.ckpt:
>>>>>>> total 1868
>>>>>>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>>>>>>> ompi_blcr_context.1850
>>>>>>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>>>>>>> snapshot_meta.data
>>>>>>> -bash-2.05b$ pwd
>>>>>>>
>>>>>>>
>>>>>>> The file system getting the checkpoints is local. I've tried /
>>>>>>> scratch
>>>>>>> and others as well.
>>>>>>>
>>>>>>> I can checkpoint some codes (like xhpl) just fine across 8 mpi
>>>>>>> tasks
>>>>>>> ( t
>>>>>>> nodes), dumping 254M total. Thus, the very long/stuck
>>>>>>> checkpointing
>>>>>>> seems rather application dependent.
>>>>>>>
>>>>>>> Here's how I configured openmpi
>>>>>>>
>>>>>>> ./configure --prefix=/nfs/ds01/support/sharon/
>>>>>>> openmpi-1.3a1r18241
>>>>>>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-
>>>>>>> shared
>>>>>>> --enable-mpi-threads=posix --enable-libgcj-multifile
>>>>>>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>>>>>>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for any further insights you may have.
>>>>>>> Sharon
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users