OpsMgr – Is alert reporting reliable? Reprise with some light.

alerts present in the live database, but not present in the data warehouse (anymore)

alerts present in the data warehouse with an inconsistent resolution state (compared to the data in the live database)

alerts present in the data warehouse with a not closed resolution state but not present in the live database

My conclusions have been: while you can have statistical reporting for alerts you cannot have a reliable way to report on alerts resolution status. Obviously if you’re using the data warehouse.

Alerts reporting is poorly implemented in the base management packs, but my customers are increasingly asking me more alert reporting. The want answers to questions such as:

give me a list of the alerts on CPU usage on these servers in this time period with an approximate cumulative duration during the observed period

give me a list of agents that are currently experiencing this kind of issue (i.e. where the alert is still open)

give me a stats of alerts count per resolution state per day

In this second post I share my findings and possible workarounds to address some of the issues I mentioned, so that, with some approximation, you can have alerts status reporting.

Returning to our issues list, for #1 I found nothing but to modify the grooming stored procedure for alerts, but since this is not supported and I didn’t investigate all the possible implications I will just say: we need to live with it, they are long standing alerts, they’ve been raised before the grooming time window and after all this time they’re still there.

For #2, this issue pops up from what I consider a bug in the AlertResolutionState recording in the DW, basically we have two or more different resolution state changes recorded at exactly the same time (read part1 for more information). In my environment this accounted to about 5% of the mismatched alerts. Here I needed to find a tie breaker, I plainly assumed that when we have two or more resolution state changes at the same time the higher resolution state wins. This is an approximation you need to be aware of. It is not perfect but I didn’t find anything more suitable.

#3 took me a lot of time, it turned out it is caused by deleted entities. Every time an entity is deleted in the live database (this is normal and can have several causes: agent removal, discovery flip flopping, normal objects lifecycle) it gets marked deleted in the live database and in the DW, but the active alerts are not closed, in the DW they’re just froze at their current status. Considering the fact that they can flip flop multiple times and that when I ask for alerts in a given time window I must return all the alerts generated in that time frame and their status at the end of the observation period, it took me a lot of time to write the proper SQL query. But, and this is the good news, I get rid of it and now I can share a query that will return exactly the resolution state of any given alert, well mostly. The key was, if the managed entity is not active at the end of the observation period and we have alerts for that managed entity with a resolution state different form "closed" they must be considered closed (since the managed entity is no longer valid).

My conclusions after this second shot are: we can have alert reporting even on resolution states if you can live with some approximation, notably:

you won’t get alerts raised before the alert grooming period in the DW (365 days by default if I can recall properly)

for a small percentage of the alerts, the resolution state will be approximated to the higher resolution state for that alert in that time window.

And, in the end, here is the TSQL to return the alerts resolution state at the end of a specified time window, the temporary table can be used to join the vAlert view to build your own custom reporting:

declare @StartDate datetime
declare @EndDate datetime
Set @StartDate='2011-10-07 15:09:00'
Set @EndDate='2011-10-15 15:09:00'
-- to make things clearer let's use a temporary table
-- it will contain the resolution state information for an alert at the end of the proposed time window
declare @AlertState table
(
AlertGuid uniqueidentifier,
StateSetDateTime datetime,
TimeInStateSeconds int,
StateSetByUserId nvarchar(256),
ResolutionState tinyint,
TimeFromRaisedSeconds int
)
insert into @AlertState
select DISTINCT ARS.AlertGuid,
ARS.StateSetDateTime,
ARS.TimeInStateSeconds,
MAX(ARS.StateSetByUserId) 'StateSetByUserId', -- since we can have conflicting updates i.e. updates at exactly the same time we need to have a tie breaker here. just take the MAX
/* the following case basically says if the managed entity is still valid at the end of the time window (ToDateTime null or subsequent to the time window end @EndDate)
* use the resolution state from the vAlertResolutionState view; if not set the resolution state to closed.
* The check is on the ToDateTime field, the field can be null if the managed entity is still valid.
* Every managed entity can have multiple rows (i.e. flip flops) I will take care of this in the where clause.
*/
MAX(CASE WHEN ISNULL(memg.ToDateTime,@EndDate) >= @EndDate Then ResolutionState ELSE 255 END) As 'ResolutionState',
ARS.TimeFromRaisedSeconds
from Alert.vAlertResolutionState ARS
inner join (
-- this table returns for every alert the last StateSetDateTime for the given timewindow
select MAX(R.StateSetDateTime) As 'StateSetDateTime',
R.AlertGuid from Alert.vAlertResolutionState R WITH(NOLOCK)
where StateSetDateTime <= @EndDate
group by R.AlertGuid) AL1 on AL1.AlertGuid=ARS.AlertGuid AND AL1.StateSetDateTime=ARS.StateSetDateTime
inner join Alert.vAlert A on A.AlertGuid=AL1.AlertGuid
inner join vManagedEntityManagementGroup memg on memg.ManagedEntityRowId=A.ManagedEntityRowId
where A.RaisedDateTime between @StartDate and @EndDate AND ISNULL(memg.ToDateTime,@EndDate) >= @EndDate
group by ARS.StateSetDateTime, ARS.AlertGuid, ARS.TimeInStateSeconds, ARS.TimeFromRaisedSeconds

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Have you tried simple availability reporting? When reqularly checking availability, we see a decent percentage of our servers being reported as offline on a consistent basis. Are your results simlar? The agents report as healthly, the system reports as down.

Hi Doug,
in my experience availability reports are reliable, the only issue is often the health model is cluttered by too many bad MPs so, based on the class you’re reporting availability to you can have mixed result. A few advice, be as specific as you can (for example don’t use windows computer class rather use Windows OS class, disable all the monitors that are always “red”).