Reuti wrote:
> Hi,
>
> Am 07.07.2008 um 11:31 schrieb Romaric David:
>
>> Pak Lui a écrit :
>>> It was fixed at one point in the trunk before v1.3 went official, but
>>> while rolling the code from gridengine PLM into the rsh PLM code,
>>> this feature was left out because there was some lingering issues
>>> that I didn't resolved and I lost track of it. Sorry but thanks for
>>> bringing it up, I will need to look at the issue again and reopen
>>> this ticket against v1.3:
>> Ok, so I have to wait for a 1.3 version to work with job suspend, or
>> will it be back-ported to 1.2.6 or 1.2.6 ?
>>
>>> So even it is the rsh PLM that starts the parallel job under SGE, the
>>> rsh PLM can detect if the Open MPI job is started under the SGE
>>> Parallel Environment (via checking some SGE env vars) and use the
>>> "qrsh --inherit" command to launch the parallel job the same way as
>>> it was before. You can check by setting MCA to something like "--mca
>>> plm_base_verbose 10" in your mpirun command and look for the launch
>>> commands that mpirun uses.
>> It looks like shepherd cannot be started for a reason I couldn't get yet.
>> /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
>> reading exit code from shepherd ... 255
>> [hostname:16745] ----------------------------
>
> you mean with the plain rsh startup, like a loose integration? Isn't in
> this case a proper hostlist necessary, which is for other MPI
> implementations built in the start_proc_args defined routine? AFAIK you
> can disregard the hostlist only with Open MPI's tight SGE support.

I think he's using the tight integration and not using a plain rsh
startup. From the output it shows that he's using the bundled rsh from
SGE. From my run with a recent trunk, something is indeed broken for
tight integration. I am looking at it now.