Details

Description

There is lots of technical debt surrounding SRNs. It needs to be reviewed and re-written.

One bug that it causes is the SRN logic to determine if the metric is being sent at the proper rate and if not it is re-scheduled. This mechanism reschedules all resource metrics if an agent is down for too long due to this logic. This causes major performance degradation in a large HQ instance.

SRN should also be modernized to use hibernate and ehcache rather than maintaining its own internal state.

Activity

1) create a new platform, ensure that everything is scheduled
2) change metric collection intervals and ensure that they are collecting at the proper rate
3) change metric intervals while an agent is down. When it comes back up it should get the new schedule
4) update metric schedule templates
5) remove a resources and make sure that the metrics are unscheduled (you'd see warnings in the logs if there was an issue here)

At each step check the EAM_SRN table, for the resource that is being rescheduled, to make sure that the SRN is only incremented by 1 each time you reschedule.

Scott Feldstein
added a comment - 12/Jan/12 5:47 PM resolving.
Please test the following scenarios:
1) create a new platform, ensure that everything is scheduled
2) change metric collection intervals and ensure that they are collecting at the proper rate
3) change metric intervals while an agent is down. When it comes back up it should get the new schedule
4) update metric schedule templates
5) remove a resources and make sure that the metrics are unscheduled (you'd see warnings in the logs if there was an issue here)
At each step check the EAM_SRN table, for the resource that is being rescheduled, to make sure that the SRN is only incremented by 1 each time you reschedule.