LRM bug

The LRM treats operation timeouts as ERROR:s - not just failed operations that give warnings. This violates the meaning of ERROR: messages in the code.

We reserved ERROR: messages for things that the software did not expect - and therefore possibly could not be properly recovered from. In this case, the behavior is perfectly expected and the condition will be properly recovered from. It just means the operation in question failed.

On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote: > The LRM treats operation timeouts as ERROR:s - not just failed > operations that give warnings. This violates the meaning of ERROR: > messages in the code. > > We reserved ERROR: messages for things that the software did not expect > - and therefore possibly could not be properly recovered from. In this > case, the behavior is perfectly expected and the condition will be > properly recovered from. It just means the operation in question failed. > > An sample message: > ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000 > (47) Timed Out (timeout=60000ms) > > Because of this one message, you can't tell customers "If you ever have > an ERROR: message, the HA software has failed". > > This ought to just be a warning, like any other failed action...

I guess that ERROR is used because resource agents use the same severity when reporting failures they cannot recover from. In this case, the RA won't log anything, so the lrmd does that on its behalf. That seems OK to me. The other option would be to remove the ERROR severity log messages in all RA, because a resource problem should normally always be recoverable.

On 08/07/2012 08:18 AM, Dejan Muhamedagic wrote: > Hi Alan, > > On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote: >> The LRM treats operation timeouts as ERROR:s - not just failed >> operations that give warnings. This violates the meaning of ERROR: >> messages in the code. >> >> We reserved ERROR: messages for things that the software did not expect >> - and therefore possibly could not be properly recovered from. In this >> case, the behavior is perfectly expected and the condition will be >> properly recovered from. It just means the operation in question failed. >> >> An sample message: >> ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000 >> (47) Timed Out (timeout=60000ms) >> >> Because of this one message, you can't tell customers "If you ever have >> an ERROR: message, the HA software has failed". >> >> This ought to just be a warning, like any other failed action... > I guess that ERROR is used because resource agents use the same > severity when reporting failures they cannot recover from. In > this case, the RA won't log anything, so the lrmd does that on > its behalf. That seems OK to me. The other option would be to > remove the ERROR severity log messages in all RA, because a > resource problem should normally always be recoverable. The exceptions that print ERROR: should be relegated to things like "The CRM gave me a command I didn't understand, or referenced a resource that I don't know about" -- and similar things that really shouldn't happen.

On Tue, Aug 07, 2012 at 11:04:22PM -0600, Alan Robertson wrote: > On 08/07/2012 08:18 AM, Dejan Muhamedagic wrote: > > Hi Alan, > > > > On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote: > >> The LRM treats operation timeouts as ERROR:s - not just failed > >> operations that give warnings. This violates the meaning of ERROR: > >> messages in the code. > >> > >> We reserved ERROR: messages for things that the software did not expect > >> - and therefore possibly could not be properly recovered from. In this > >> case, the behavior is perfectly expected and the condition will be > >> properly recovered from. It just means the operation in question failed. > >> > >> An sample message: > >> ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000 > >> (47) Timed Out (timeout=60000ms) > >> > >> Because of this one message, you can't tell customers "If you ever have > >> an ERROR: message, the HA software has failed". > >> > >> This ought to just be a warning, like any other failed action... > > I guess that ERROR is used because resource agents use the same > > severity when reporting failures they cannot recover from. In > > this case, the RA won't log anything, so the lrmd does that on > > its behalf. That seems OK to me. The other option would be to > > remove the ERROR severity log messages in all RA, because a > > resource problem should normally always be recoverable. > The exceptions that print ERROR: should be relegated to things like "The > CRM gave me a command I didn't understand, or referenced a resource that > I don't know about" -- and similar things that really shouldn't happen. > > Or that's how it seems to me anyway...

Turns out that this comes from the crmd not lrmd. The lrmd actually does issue just a warning.

I see your point though I'd still be reluctant not to log an error somewhere, because all other resource errors are logged at that severity.