Client does its own read ahead. Hopefully this doesn’t interact
with other read-ahead mechanism in a bad way. Client has a lot of control
when server should write data to disk, but client doesn’t have information
to make this decision wisely.

Application modes: Buffered, O_DIRECT, O_SYNC, or async IO.

Some applications buffer in user space.

Top 10 rules for tuning clients:

First tune your network. How fast is the network. What is
slow? Is slow really slow? Think of entire network path between to
computers.

Use NFSv3, not NFSv2 or NFSv4 (NFSv4 mostly works though).

Use TCP, don’t use UDP under any circumstances. Data curruption.
Invisible data curruption.

Use maximum transfer size supported. On Linux this is negotiated
by default. Too large is bad, but most clients/servers don’t support
packets this large anyway.

Tune the filesystem. Tune for NFS. Not local workloads. Don’t tune for
application as if it was run locally. Some tunings must be done at mkfs time.
Be careful with partitioning and volume manager arrangement. Use noatime or
relatime (modern kernels). atime has no affect on client (disputed by Matthew
Wallis). noatime may break some applications. XFS may need special care.
Old code used bad defaults. Optimise log IO. Choose allocation groups.

Tune the VM. Push changes that are dirty before client decides to do it.
However your results may vary. Do not reduce XXX value.

Tune PCI card. A card can slow down cards on other busses. Two
cards on a bus can slow down the bus. Sticking card in empty slot may
slow things down excessively. NUMA effects. Fibre channel multiple pathing.
Multiple paths. In general, don’t run all requests down one channel,
use parallel paths if possible. Know your hardware.
6.Tune the network. Speed, duplex, errors. NFS on server is almost
always bulk traffic. Bind NIC interrupts to CPUs, keep device cachelines
hot for one CPU. ifconfign tells you the IRQ. irqbalanced doesn’t know
your hardware. Increase socket buffer sizes. sysctls have changed. TSO -
TCP Segment Offload - TCP grant work done by the hardware, not the software.
The card chops it up into TCP segment. Depending on the card, it
might not be enabled by default. Use ethtool -K ethN tso on. RSS - Receive
side scaling. Split up interrupt load into multiple CPUs. New thing. May
need to fiddle. Hardware checksum might be off by default. Enable
IPoIB by default. Fix default ARP. Bonding and NUMA.

Think about async export option. Can put newly written data at risk,
however it if faster for some workloads. Good for client, bad for server.
Tells server to lie about when data is actually written to disk.

Use no_subtree_check. No benefit except to consume excess CPU cycles.

Use more server threads. Default is way too low. stats in /proc
are wrong and difficult to understand. Server structures sized at
server stateup. Use too many, eg 128. Never can have too many.

Measurement Tips & Tricks

Application developers:

Use large buffers, large IO, don’t try to be too clever. Don’t
use O_DIRECT, it does things that sometimes makes things faster locally,
but never remotely.