Pinned topicnetwork congestion issues

‏2012-12-15T10:31:45Z
|Tags:

Answered question
This question has been answered.

Unanswered question
This question has not been answered yet.

in a cluster from time to time we are observing relatively large waiters of this kind
0x2AAAAC10BE60 waiting 2.333560000 seconds, NSDThread: on ThCond 0x2AAAB038EB18 (0x2AAAB038EB18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg'

When this happens, we observe a large number of TCP retransmissions from the servers to the clients. We have seen that this happens when the (1GE) NICs of the clients saturate (due to GPFS traffic). There is no saturation instead on the server side (10GE), nor in the intermediate network. So it looks like the problems are simply due to saturation of the client NIC interfaces.

I tried to tweak maxMBpS. Whatever value, it is completely ineffective. One detail: the client node mounts the filesystem from a remote cluster. Is this parameter working also when the filesystem is mounted from remote?

Re: network congestion issues

I tried to tweak maxMBpS. Whatever value, it is completely ineffective. One detail: the client node mounts the filesystem from a remote cluster. Is this parameter working also when the filesystem is mounted from remote?

I only worried about MaxMBpS the other way until now: It was too low and prevented GPFS from exploiting the hardware capabilities (on 10G and IB networks). As a result my IO performance was lower than what we expected/desired.

I understand that it should reflect the hardware capability/bandwith available and is used by GPFS to tune its internals for that available bandwidth.

In you case you may also want to see if you can optimize the setting of the IP stack.