Description of problem:
If the RHQ server is busy, the RHQ Agent can stop reporting metrics. This can happen if the database server is overloaded with too many clients. Consequently, no more metrics are reported and alerting cannot occur.
This is a request that the agent plugin either:
* Generate an event saying there was this condition (and others like it). Unfortunately, this may generate a ton of events across hundreds of agents.
* Mark itself as unavailable. In effect, the agent is stopped working. This may be simpler to diagnose.
* A sub resource marks itself as unavailable. This is probably ideal.
How reproducible:
You can probably induce this by reducing the number of server threads to 1, increasing the number of agents to a lot, and creating a table lock. Or maybe just a table lock.
You should see:
2013-01-18 19:04:34,075 ERROR [ClientCommandSenderTask Thread #4] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed
to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=st11p01ad-ad001.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=YWI
LXUV/JelWuPeWz4KN03nqFKNaBO2ZRvHb4lcuW2a4KmLXxuuH9Daa3CKaO0f0jJo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasuremen
tReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurren
t.TimeoutException
This should be detectable.