Update-
I have my workaround working (exiting the extra conflicting epilogue
scripts) properly now. I still consider this a serious bug, since I
wouldn't have had to go through this runaround otherwise. I'm aware of
a few other people that are negatively impacted by this as well. I'll
post a bug when I can.
Jeremy
On 3/15/2010 5:38 PM, Jeremy Enos wrote:
> This seemed to kind of die here, but my problem has not.
>> If I understand correctly, the description of the design purpose
> (previous epilogue attempt fails, so it tries again), then no two
> epilogues for the same job should ever run simultaneously. Yet they
> do. So perhaps I'm seeing a different issue than the described logic
> which is intentional.
>> I've also tried unsuccessfully to "lock" the first epilogue in place,
> and abort if that lock is already in place. I'm doing this via the
> lockfile utility- and for whatever reason, it's not effective in
> preventing multiple epilogues to launch simultaneously for the same job.
>> Let me explain why it's important for me that this doesn't happen- in
> the epilogue, I run a health check on a GPU resource which has a
> failure condition if the device is inaccessible. I'm getting loads of
> false positive detections simply because the device /is/ inaccessible
> while another epilogue is running a health check already. I can't
> seem to get effective logic in place to prevent this from happening (I
> already check ps info for epilogue processes launched against the
> given jobid, and it's only partially effective). My only option is to
> disable my health check altogether to prevent the false positive
> detection due to conflicting epilogues.
>> I want and expect a single epilogue (or epilogue.parallel) instance
> per job per node, as the documentation describes. Why is this
> behavior not considered a bug??
>> Jeremy
>> On 2/3/2010 5:49 PM, Jeremy Enos wrote:
>> Ok- so there is design behind it. I have two epilogues trampling
>> each other. What is giving Torque the indication that a job exit
>> failed? In other words, what constitutes a job exit failure?
>> Perhaps that's where I should be looking to correct this.
>> thx-
>>>> Jeremy
>>>>>> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>>>>>> that I shouldn't have to. Unless of course this behavior is by design
>>>> and not an oversight, and if that's the case- I'd be curious to know why.
>>>>>>> Because the previous job exit failed and it needs to be done again.
>>>>>>>>>>>>>>> _______________________________________________
>>> torqueusers mailing list
>>>torqueusers at supercluster.org>>>http://www.supercluster.org/mailman/listinfo/torqueusers>>>>>>>>> _______________________________________________
>> torqueusers mailing list
>>torqueusers at supercluster.org>>http://www.supercluster.org/mailman/listinfo/torqueusers>>>>> _______________________________________________
> torqueusers mailing list
>torqueusers at supercluster.org>http://www.supercluster.org/mailman/listinfo/torqueusers>-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/25911466/attachment.html