Keep traffic on list for benefit of others.
No, this is something I cannot share because I cannot 'direct'
you to the references. It is vendor data, based on their experimentation
and testing, in house and at customer sites.
I will try to see if I can get more info on this from engineers at
vendor, then post same.
hv
Ariel Sabiguero wrote:
> This thing is very interesting.
> Is the research available somewhere?
> I would appreciate if you help me accessing to that study.
>> Regards.
>> Ariel
>> H.Vidal, Jr. wrote:
>>> Which filesystem do you use?
>>>> On recent research into NAS appliances, we found out
>> that things like ext2 and ext3 really don't play so nice
>> with NFS, especially under older 2.4 kernels. However,
>> in general, our vendor warned us to keep away from large,
>> multi-user shares via NFS under ext<2|3> filesystems.
>>>> Don't know if this is helpful, perhaps just a data point.....
>>>> hv
>>>> David Mathog wrote:
>>>>> I'm seeing a serious NFS problem with MDK 10.0
>>> (using a 2.6.8-1 kernel.org kernel). All of the
>>> machines run the same OS version.
>>>>>> In a 20 node cluster each node NFS mounts /u1 from
>>> the master. They run a calculation and generate
>>> a file in /tmp of about 26000 lines coming to
>>> 1.3Mb (both the number of lines and total size
>>> vary a little). When it completes the process on
>>> each end node does:
>>>>>> mv /tmp/blah.$NODENAME . mv -f /tmp/blah.$NODENAME /tmp/SAVEblah
>>>>>> The home directory (".") is a couple of
>>> levels down under /u1, so this effectively performs
>>> a network copy from /tmp on the compute node to /u1
>>> on the master node. The copies are largely asynchronous
>>> since the end nodes complete at various times.
>>>>>> On the master node there are occasionally
>>> (defined as: 1 bad line, out of 20 files, every
>>> 3rd or 4th run) a very long bad line.
>>>>>> Here are four lines from the original file on /tmp:
>>>>>> '4827135'=='-22004070' (3254 9815 3391 9675) 22
>>> '4827135'=='-22004070' (75050 11805 75081 11774) 0
>>> '4827086'=='-22004070' (79588 9817 79809 9594) 28
>>> '4827086'=='-22004070' (34069 11794 34308 11555) 34
>>>>>> Here are the four lines from the copy on /u1 .
>>>>>> '4827135'=='-22004070' (3254 9815 3391 9675) 22
>>> '4827135'=='-22004070' (75050 11805 75081 11774) 0
>>> '4827086'=='-22004070' (79588 9817 798<NUL>(MANY times)<NULL>1
>>> '4156131'=='+22004070' (58122 9687 58250 9818) 11
>>>>>> The final line on /u1 does appear in /tmp, but much, much
>>> farther into the file. I very carefully cut out the missing
>>> text from the original file, pasted it into a new file, and found:
>>>>>> % wc deleted.txt
>>> 642 3849 32769 deleted.txt
>>>>>> So it looks like a block of 32768 bytes was lost
>>> (+1 probably for an extra EOL in my deleted.txt file)
>>> during the mv operation and all bytes replaced
>>> with <NUL>. On repeated runs on the same data (same
>>> output files each time) the problem line never occurs
>>> twice in the same place, and it hops from node to node,
>>> suggesting that it's a rare event somewhere in the
>>> data transport (mv) operation.
>>>>>> This is very, very, VERY bad.
>>> No relevant messages show up in /var/log/messages.
>>> /u1 is /dev/sde1 and smartctl -a on that device shows
>>> no errors. On the master /u1 is in /etc/fstab as:
>>>>>> LABEL=usrdisk /u1 ext2 defaults,quota 1 2
>>>>>> and is exported as:
>>>>>> /u1 *.cluster(rw,no_root_squash)
>>>>>> Has anybody else seen this bug?
>>>>>> Is there a patch for it? Possibly relevant software:
>>>>>>>>> coreutils-5.1.2-1mdk #/bin/mv
>>> nfs-utils-clients-1.0.6-1mdk #nfs client
>>> kernel 2.6.8-1 #kernel.org
>>>>>> Thanks,
>>>>>> David Mathog
>>>mathog at caltech.edu>>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org>>> To change your subscription (digest mode or unsubscribe) visit
>>>http://www.beowulf.org/mailman/listinfo/beowulf>>>>>>>>> ------------------------------------------------------------------------
>>>> Subject:
>> [Beowulf] serious NFS problem on Mandrake 10.0
>> From:
>> "David Mathog" <mathog at mendel.bio.caltech.edu>
>> Date:
>> Fri, 03 Dec 2004 13:05:01 -0800
>> To:
>>beowulf at beowulf.org>>>> To:
>>beowulf at beowulf.org>>>>>>I'm seeing a serious NFS problem with MDK 10.0
>>(using a 2.6.8-1 kernel.org kernel). All of the
>>machines run the same OS version.
>>>>In a 20 node cluster each node NFS mounts /u1 from
>>the master. They run a calculation and generate
>>a file in /tmp of about 26000 lines coming to
>>1.3Mb (both the number of lines and total size
>>vary a little). When it completes the process on
>>each end node does:
>>>> mv /tmp/blah.$NODENAME .
>> mv -f /tmp/blah.$NODENAME /tmp/SAVEblah
>>>>The home directory (".") is a couple of
>>levels down under /u1, so this effectively performs
>>a network copy from /tmp on the compute node to /u1
>>on the master node. The copies are largely asynchronous
>>since the end nodes complete at various times.
>>>>On the master node there are occasionally
>>(defined as: 1 bad line, out of 20 files, every
>>3rd or 4th run) a very long bad line.
>>>>Here are four lines from the original file on /tmp:
>>>>'4827135'=='-22004070' (3254 9815 3391 9675) 22
>>'4827135'=='-22004070' (75050 11805 75081 11774) 0
>>'4827086'=='-22004070' (79588 9817 79809 9594) 28
>>'4827086'=='-22004070' (34069 11794 34308 11555) 34
>>>>Here are the four lines from the copy on /u1 .
>>>>'4827135'=='-22004070' (3254 9815 3391 9675) 22
>>'4827135'=='-22004070' (75050 11805 75081 11774) 0
>>'4827086'=='-22004070' (79588 9817 798<NUL>(MANY times)<NULL>1
>>'4156131'=='+22004070' (58122 9687 58250 9818) 11
>>>>The final line on /u1 does appear in /tmp, but much, much
>>farther into the file. I very carefully cut out the missing
>>text from the original file, pasted it into a new file, and found:
>>>>% wc deleted.txt
>> 642 3849 32769 deleted.txt
>>>>So it looks like a block of 32768 bytes was lost
>>(+1 probably for an extra EOL in my deleted.txt file)
>>during the mv operation and all bytes replaced
>>with <NUL>. On repeated runs on the same data (same
>>output files each time) the problem line never occurs
>>twice in the same place, and it hops from node to node,
>>suggesting that it's a rare event somewhere in the
>>data transport (mv) operation.
>>>>This is very, very, VERY bad.
>>>>No relevant messages show up in /var/log/messages.
>>/u1 is /dev/sde1 and smartctl -a on that device shows
>>no errors. On the master /u1 is in /etc/fstab as:
>>>>LABEL=usrdisk /u1 ext2 defaults,quota 1 2
>>>>and is exported as:
>>>>/u1 *.cluster(rw,no_root_squash)
>>>>Has anybody else seen this bug?
>>>>Is there a patch for it? Possibly relevant software:
>>>>>>coreutils-5.1.2-1mdk #/bin/mv
>>nfs-utils-clients-1.0.6-1mdk #nfs client
>>kernel 2.6.8-1 #kernel.org
>>>>Thanks,
>>>>David Mathog
>>mathog at caltech.edu>>Manager, Sequence Analysis Facility, Biology Division, Caltech
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf>>>>>>>>------------------------------------------------------------------------
>>>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf>>>>>
-------------- next part --------------
An embedded message was scrubbed...
From: Ariel Sabiguero <asabigue at irisa.fr>
Subject: Re: [Fwd: [Beowulf] serious NFS problem on Mandrake 10.0]
Date: Sat, 04 Dec 2004 13:54:29 +0100
Size: 19257
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041204/7d635a70/attachment.mht>