NFS - lockd hanging

Twice over the last week we have had cases where our application has locked up when locking files on a remote system being accessed with NFS. The NFS client server (running the application) is running 11.23, the NFS host server is running 11.31. When this behavior occurs the lockd process on the NFS host becomes unresponsive. It does not respond to any kill commands (including kill -17 + kill -9), so I am unable to turn on logging to get any info about the cause of the problem and the only way to resolve the issue is to reboot the NFS host server.

When this occurred yesterday I did turn on logging on the client server. The results are below (There was a lot of repetition in the file I tried to extract the unique information). Logging was only turned on for a few minutes and the file is huge (~16MB), so I am hesitant to turn on logging full time on the NFS server. But, I can if that will be the only way to diagnosis the problem if this does happen again.

This post (http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1273299695512+28353475&threadId=1052530) references using the â C option for lockd to solve interoperability problems between 11.23 + 11.31 systems. I did turn this on yesterday (by starting and stopping the statd + lockd processes on the client server) when the problem occurred but it did not fix it (I am hoping it is just because the lockd process on the host server was already hosed). Do the problems we are having sound like they would be caused by this â C option not being turned on? Are there other settings we should look at? Is it impossible to tell without logging from the host server?

Additional Facts:â ¢ We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.â ¢ The NFS client server has two network cards, so the traffic from the client may be coming from two different IP addresses.

Any help would be much appreciated. Let me know if any additional information would be useful. Thanks.

Re: NFS - lockd hanging

Shalom,

HP-UX 11.31, which includes NFS v4 does include a more robust locking mechanism.

This appears to be a user volume issue.

>> We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.<<

I would check the back end server for high water marks with glance/gpm, make sure there are enough file handles, maxuprc has a high enough limit of processes for a single user (perhaps the application user is hitting this limit).

Re: NFS - lockd hanging

Unfortunately we do not have glance on the server. Are you coming to the conclusion that it is a number of users issue based on the fact we increased the number of users on the system recently or is something in the log file indicating that?

Re: NFS - lockd hanging

The logging data you collected from the 11.23 client is not enough to tell me what the problem is. The only thing I can say with any degree of certainty looking at this data is that the rpc.lockd on the NFS server is either not replying to the client's lock requests or is replying to the wrong port number on the client because the client's rpc.lockd never gets the reply.

Here's the only real interesting thing in this logfile:________________________________________

The first couple lines show me there are currently no locks being held by this client because the Granted queue is empty. The Message queue has many entries with a message reply status of NULL. This means the client has sent all these lock requests (and we know they are lock requests because the PROC= 7 maps to NLM_LOCK_MSG in the rpc.lockd protocol) to the server and never received a reply (reply = NULL).

So the client is trying desperately to lock files on the server and the server is either not replying or sending the replies to the wrong place.

The only way to really get a good read on this would be to enable debug rpc.lockd logging on the client and server *AND* collect a network trace during the problem. The lockd logging will tell us what the respective daemons are doing from a locking perspective and the network trace will tell us if the lock requests and replies are being sent to the correct destinations.

You have the correct method of turning on rpc.lockd logging on the 11.23 system (obviously since you sent in debug logging data), but on 11.31 systems it is controlled by a kctune parameter:

# kctune klm_log_level=9

This will enable debug rpc.lockd logging to the console buffer and /var/adm/syslog/syslog.log file.

As for the -C option, there's no harm in adding that to the LOCKD_OPTIONS string of the /etc/rc.config.d/nfsconf file on the 11.23 system. This option doesn't exist on 11.31, so only modify the 11.23 system. That way when rpc.lockd is re-started on the 11.23 system it will have the proper option.

Re: NFS - lockd hanging

Unfortunately this occurred again today. I was able to turn on logging for the client and server. I attempted to use nettl to collect network data, but am not sure if it contains useful data and does contain some sensitive info. Let me know if seeing this data would be useful and I will try to filter it out, or it we should be using a different tool to perform the network trace.

The attached log files are from a time period where the application was hanging. In this situation the server ux01 is the client and ux05 the host (although in some cases ux05 acts as the client and ux01 the server â we have not had problems with this scenario). Again, the client (ux01) is on 11.23 and the server (ux05) 11.31. Is there anything informative to anyone in these files?? I didnâ t have any luck matching up requests and responses.

Below is the fstab files from both the client + server. I donâ t see anything in there about the NFS mounted file systems.

Re: NFS - lockd hanging

One thing that I do notice that is different now that the NFS locking is working again is that there are server records in the log file in syslog.log on the host server (ux05). Example below. Am I correct in assuming that the absence of these messages in the log file when there are problems indicate that the requests are not making it to the server?

Re: NFS - lockd hanging

This is the KLM subsystem logging the lock request processing to syslog as I indicated. When klm_log_level is set to 9 you should see this kind of information whenever the system is dealing with NFS file locks. If you don't see this stuff in the syslog then the system is currently not doing any NFS file locking.

Re: NFS - lockd hanging

I did use that method for enabling logging and there were messages in syslog, but in my uneducated interperation of the log file, they all appeared to be messages where the server was acting as the client instead of the host.

Rebooting the system did fix the problem. After the system was rebooted is when I began seeing messages in syslog where this server seemed to be the host and managing lock requests from the client.

Re: NFS - lockd hanging

We did turn on the logging when we began having problems this morning. Are you saying that if logging is already on when the problem first occurs, you would expect to see some event which causes the lockd process to be unresponsive?

I know this is a very difficult (maybe impossible) question to answer, but I would very much like to avoid this issue occurring again. Based on the information shared already do you have any hunch as to what possible causes of the issue could be? Or any ideas for a way to more gracefully recover without needing to reboot?

If the problem does occur again what would be the most useful tool for gathering network trace data? Nettl, tcpdump, other?

Sorry for being so difficult. I VERY much appreciate the help you have given.

Re: NFS - lockd hanging

> Are you saying that if logging is already > on when the problem first occurs, you> would expect to see some event which > causes the lockd process to be > unresponsive?

Exactly. The vast majority of the file locking issues I've worked on fall into two buckets:

1) Locking never worked2) Locking worked for a while then stopped

In cases where locking has never worked, it's pretty easy to figure out by collecting a debug logfile from both sides and reproducing the problem. It's also very easy to reproduce the problem because locking is broken completely.

Cases where locking works for a while then mysteriously stops are much harder to figure out because it's usually not possible to reproduce them at will. In these situations having the historical data from before the problem began is critical as it helps pinpoint when the trouble began and what sequence of events took the system from a working state to a failing state. This helps us know where to focus our investigation.

> I know this is a very difficult (maybe> impossible) question to answer, but I> would very much like to avoid this issue> occurring again. Based on the information > shared already do you have any hunch as to > what possible causes of the issue could> be?

The data you've provided so far is all post-problem, so all I see is the client sending requests and never getting a response. Not enough information for me to determine root cause.

In the 2nd set of data you attached, I do see the client sending some CANCEL messages, so if rpc.lockd on the 11.23 system is not running with the -C option yet, that's definitely the next thing you should enable.

> Or any ideas for a way to more> gracefully recover without needing to> reboot?

Everything I've seen so far is pointing to the server stops responding. Also, the fact that you reboot the 11.31 system and the problem clears up is more justification of this being a server-side issue.

If you enter the following on the 11.31 server:

# /sbin/init.d/lockmgr stop# /sbin/init.d/lockmgr start

and it doesn't recover gracefully then you may be left with no other recourse than rebooting the server.

> If the problem does occur again what would > be the most useful tool for gathering > network trace data? Nettl, tcpdump, other?

Nettl is fine. It captures data in a format that either netfmt or Wireshark can decode.

Re: NFS - lockd hanging

Because this issue seems to be the result of a build up of requests, we tried to prevent it by happening by restarting the lockd process on the host server. This was done using the command:/sbin/init.d/lockmgr stop/sbin/init.d/lockmgr start

The statd process died, but the lockmgr process did not. After this was done we began having the same issue again where the client could not lock files. The log file did not show anything interesting, other than SERVER REQUESTS stopped showing up in the file. It did still contain CLIENT REQUEST messages.

Does the fact that the kill command caused the process to stop functioning as a host (but did not kill the process) provide any insight to what our problem could be to anyone?

Re: NFS - lockd hanging

> we tried to prevent it by happening by> restarting the lockd process on the host> server. This was done using the command: > /sbin/init.d/lockmgr stop > /sbin/init.d/lockmgr start >> Does the fact that the kill command caused > the process to stop functioning as a host > (but did not kill the process) provide any > insight to what our problem could be to > anyone?

Not me. As I've said, the best way to figure out why this problem happens is to enable logging on the client and server before the problem occurs and then wait for it to happen - on its own, not by killing daemons. That's your best chance to capture the state of both systems when things are working properly and the event that causes one of them to start misbehaving.