Jorge,
are you using hardcoded /etc/hosts on the machines, or are you using any DNS
(which might sometimes be unavailable)? The machines have enough memory for your
job, or started they to swap?
Which MPI lib & version? - Reuti
Quoting jorgegg at sas.upenn.edu:
> Hi,
> I'm running a fortran 90 code on a Linux cluster with 7 nodes (I actually
> only
> use 6) using the MPI library. I can change the "size" of the program
> (meaning
> the number of operations to be performed although all operations are the
> same).
> The problem is that when I try to run the program using mpirun sometimes
> --most
> of the times but not always-- the program won't start running and I'll get
> the
> following message (the name of the cluster is max and it's not always the
> node
> number 2):
> p0_20621: p4_error: Timeout in making connection to remote process on
> maxsl2-d:
> 0
> bm_list_20622: p4_error: interrupt SIGINT: 2
>> Some other times it would run fine even with the same number of operations!
> It's
> not the number of people using the cluster because most of the time it's
> only
> me. This problem also arises sometimes after 3 or 4 hours of running the
> program.
> Do you have any idea of why this happens? I estimate that with this number
> of
> nodes my code should run around 3 weeks to finish so I really need to rely
> on
> the computers keep communicating.
> Thank you very much and please let me know if I didn't explain myself
> clearly.
> Jorge
>>>>> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org> To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf>