hi Jon,
It seems a kernel page problem. Maybe somehow a file manager or other
software had allocated too many shared memory pages?
This is easy to check by executing 'ipcs' at every node.
I saw some strange things there in kernel used by Scientific Linux
6.2 - even after deletion of shared memory pages it kept remembering
them after a reboot.
On Aug 30, 2012, at 9:24 PM, Jon Tegner wrote:
> Hi,
>> have this strange error. We run CFD calculations on a small cluster.
> Basically it consists of bunch of machines connected to a file system.
> The file system consists of 4 servers, CentOS-6.2, ext4 and glusterfs
> (3.2.7) on top. Infiniband is used for interconnect.
>> For scheduling/resource management we use torque/maui, and
> typically we
> submit job in a torque submit script like:
>> mpirun -machinefile bla bla bla
>> However, at one point one of the machines serving the file system went
> down, after spitting out error messages as indicated in
>>https://bugzilla.redhat.com/show_bug.cgi?id=770545>> We used the advice indicated in that link ("sysctl -w
> vm.zone_reclaim_mode=1"), and after that the file servers seems to run
> OK. This happened in the middle of summer, and a few weeks later we
> noticed a few strange things:
>> 1. We had to change the torque submit script like
>> ssh $(hostname) "mpirun -machinefile bla bla bla"
>> 2. zone_reclaim_node were set to 1 on all computational nodes (on the
> file servers this was done explicitly, NOT so on the computational
> nodes).
>> 3. We have seen particularly lousy performance on one of our
> applications.
>> 4. The command "tail -f file" doesn't get updated properly.
>> Any help/hints would be greatly appreciated!
>> Regards,
>> /jon
>> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf