On Tue, 9 Dec 2008, David Wolfskill wrote:
> On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote:
>> I seem to have a fairly- (though not deterministly so) reproducible
>> mode of failure with an NFS-mounted directory hierarchy: An attempt to
>> traverse a "sufficiently large" hierarchy (e.g., via "tar zcpf" or "rm
>> -fr") will fail to "visit" some subdirectories, typically apparently
>> acting as if the subdirectories in question do not actually exist
>> (despite the names having been returned in the output of a previous
>> readdir()).
>> ...
>> I was able to reproduce the external symptoms of the failure running
> CURRENT as of yesterday, using "rm -fr" of a copy of a recent
> /usr/ports hierachy on an NFS-mounted file system as a test case.
> However, I believe the mechanism may be a bit different -- while
> still being other than what I would expect.
>> One aspect in which the externally-observable symptoms were different
> (under CURRENT, vs. RELENG_7) is that under CURRENT, once the error
> condition occurred, the NFS client machine was in a state where it
> merely kept repeating
>> nfs server pid848 at fbsd-build:/volume: not responding
>> until I logged in as root & rebooted it.
>The different behaviour for -CURRENT could be the newer RPC layer that
was recently introduced, but that doesn't explain the basic problem.
All I can think of is to ask the obvious question. "Are you using
interruptible or soft mounts?" If so, switch to hard mounts and see
if the problem goes away. (imho, neither interruptible nor soft mounts
are a good idea. You can use a forced dismount if there is a crashed
NFS server that isn't coming back anytime soon.)
If you are getting this with hard mounts, I'm afraid I have no idea
what the problem is, rick.