SUMMARY:first access to nfs server

My initial question was:
> We're experiencing a baffling nfs problem in our
>beta environment.
> It started a couple of weeks ago, and may coincide with
>events such as a sendmail storm, moving a yp master, or the
>network people cleaning the data closet.
> Everything worked fine before this.
>The server is a 4/470 running 4.1.1, with about 8
ss2 diskfull
>clients running 4.1.1. The server is on the other side of
>a repeater for thicknet to thinnet. A client moved to
>the same side of the repeater as the server showed some
>improvement, but not much.
> The user's home directories and oracle forms are
>mounted from the server.
> After a reboot of a client, login for a user takes
>about 3 minutes. If the same user logs out and back in
>again, the login is seconds. With the oracle forms, the first
>time the user executes the command, it takes 5 to
10 minutes
>for the form to come up. Again if the user exits
the oracle
>form, and re-executes the command, the form comes
up in
>seconds.
> The collision rate on the server is about 8.57%.
>The clients nfsstat -rc shows 0 badxid, meaning the server
>is acknowledging the nfs request, yet successive nfsstat -rc
>commands while the oracle form is in the 10 minute mode shows
>retrans and timeout incrementing.
> NIS seems to be in order.

First of all, I'd like to correct the badxid meaning
above. 0 badxid with a high timeout means the server is NOT
receiving some of the request and the network is suspect. I
must've misread something the first time.
Losts of responses, pointing to NIS, rcp and the net.
The problem was finally found to be with the local network.
A terminator card in a concentrator box in the wire closet bad.
The concentrator as I got it, breaks up a bunch of type 1 input
to one fiber optic output for routing to another wire closet.
For some reason only the one beta server was noticable affected.
The hardest part was convincing the network folks that
there was a problem in the network. This was accomplished by
hooking the bad server up to another drop near the oracle server,
which, by luck, did not go through the wire closet with the bad
concentrator board. There the first time logins and oracle form
access was fast. That led to some piecemeal checking by the
network folks and a fix. Thought I'd mention a previously
hooked up sniffer failed to show any problems.
Thanks to all who replied and responses are included
for your pleasure.

8.57%!! That is horrible. No wonder. This is clearly an ethernet problem. This
will cause major headaches. The reason why it is better the second time is
because much of it is in cache locally.

A typical collision rate should be below 1%.

>Our collision rate is now around 6%. That's taking the output
>of netstat -i on the server and dividing Collis by Opkts.
----------------------------
from: rhaddick@us.oracle.com

I got tired of the slow login thing, so on all workstations we:

mv /usr/ucb/quota /usr/ucb/quota.old;
ln -s /bin/true /usr/ucb/quota;

Quota is a pain in the.....told this one to my Sun
Software Support
person......As for the Forms issues, I would like to politely
decline on that one. Chances are good, of course, it's something
completely different..like a comm box setup or something else one
of the Net Wizards has seen before...However, I hope this helps!!:-)

logins could be hanging on NIS binds; similarly the oracle
forms could be waiting to get bound to an NIS server
(for rpc.byname or something like passwd file entries).

just for laughs: copy the password file to a client machine,
and turn off NIS. see if the problem goes away. if it
does, there's a patch to NIS that makes it bind/rebind
a lot faster -- what may be happening is that your
NIS
server isn't responding fast enough, and your NIS clients
are trying to rebind every few minutes. the default
rebind timeout/process is *very* slow -- 2-3 minutes in 4.1.1.

If you are running DNS, since you have made lot of
network changes and moving the NIS master, make sure
that the DNS config files are right and if you
are using /etc/resolv.conf on the clients then
check them too.

Hmn. Smells like a thinnet problem to me. Maybe a loose T connectoror something. The collision rate looks awfully high. The first accesses
to things are probably slow because you are getting a lot of
re-tranmissions. Once the files are cached on the
clients, things would
then be much faster. I bet if you crank wsize and
rsize down to 1k on
your NFS mounts, you will see better performance, since you won't get killed
if you lose one fragment out of the 5 (or 6) that make up an NFS data packet.
What happens if you move a client to the server side of the repeater, and
then turn the repeater *off*? That should isolate
the problem to one
side of the repeater or the other.

>I thought so to, and had already cranked rsize, wsize to 1024
>with no effect. I also tried upping timeo with no effect.
>Moving the server around, at one point we had just the server,
>client and one wire between them, was what eventually tracked
>the problem down.
--------------------------------------------------