Search

The problem

If you run availability reports or performance reports with a aggregation type of daily or hourly the reports are empty. This problem is described a lot on the web. And I have also written a couple of blog post how to fix this issue. But as you know we are using scom to monitor stuff , so why not monitor this aggregation processing and alert if a processing delay is occurring. ? That’s our mission today….

Analyze

Using SQL enterprise manager and a SQL query on the data warehouse DB we can read out the aggregation processing. This query looks like this:

Select AggregationTypeId, Datasetid, (Select SchemaName From StandardDataSet Where Datasetid = StandardDataSetAggregationHistory.Datasetid) , COUNT(*) as ‘Count’, MIN(AggregationDateTime) as ‘First’, MAX(AggregationDateTime) as ‘Last’ From StandardDataSetAggregationHistory Where LastAggregationDurationSeconds IS NULL group by AggregationTypeId , Datasetid

The output will show us how many aggregations there have still to be processed /aggreationtype (20=hourly , 30 = daily).

So in this case we have no problem. But I have seen scom environments where the state aggregations where so far behind that it was almost not possible to fix it. This bring up a point: especially the state aggregations are the tricky ones. If you have many ‘flipping’ monitors there will be a lot of state changes and so a lot of aggregations data to process. This process takes a lot of SQL CPU power and also disk space. In most of this cases it was the tempdb data space free or transaction log that was the root cause of the failure.

Solution

In scom we have for every aggregation an target. This target is named ‘Standard data set’. You can find it here:

If you compare the screenshot with the results on your scom console you will notice that you don’t have the green healthy state… And that’s why you are reading this post. So lets add this state.

I wanted to give every dataset that has to be processed a health state on how many aggregation it has still to process. So we make a monitor that executes for every data set the query above and if a threshold is hit the health state is changed. Also we will add a rule so that this aggregation behind count is put in a trend graph.

I have used VSAE for this , and I will not share the code but only the idea. Why not ? I believe you have to know what you are doing and by copy & pasting you don’t learn from it if you don’t have done it once from start till end.

The real work

Open a new VSAE project and add a empty MP fragment and a PowerShell fragment.

Then you make a datasource that reads the aggregation count. This is done using PowerShell and the SQL snapin.

The PowerShell script has as input the GUID of the dataset (property of the target) and as output a property bag with the aggregations count (daily and hourly). I made the script somewhat intelligent by reading out the registry where the data warehouse is located.

Now we use this datasource in a monitor module type to create a 3 state monitor. And since we have created a datasource module we can create also a rule that collects the aggregation behind for the trend graph. Yes know know this is easier to type as to do…

Below a snap of the datasource module

And below a snap of the monitor module type

and the monitor. Create one for hourly(not shown) an one for daily.

At last for trending we have to create a collection rule.

Notice that the monitor and collection rule are having as target the “Microsoft.SystemCenter.DataWarehouse.DataSet” alias “standard dataset” and notice the runas profile.

The result

When you have constructed the MP and build/deployed it you will see 2 extra monitors on the standard dataset targets as show above. Open the health explorer to see if all is ok.

Above dataset has had a problem. To see some details, view the performance counters and you will see the aggregations trend.

In this case the state hourly aggregations where way behind. So I followed one of my own blog posts to solve this one. Where I manually executed in a loop the state aggregation process to speed up the processing.

The End.

Yes I know this post is a bit ‘çloudy’ and not something you can download and import. But I hope by sharing the idea I triggered you to try it your self.

Sometimes you wonder why not all the reports are as the should be. For example of course you are known with the availability report . Just pick a target and period and you will get a nice report telling you when a target when unhealthy.

The challenge.

Okay nice …. but I want a report not based on the availability data but on the performance or configuration or security data. But wait this is build into the availability report isn’t it ?

looking at the report description:

Description:

“For every managed object within System Center Operations Manager, monitors configured in each of the disciplines below determine an objects time in state and then roll-up to an objects overall health. The availability report by default shows an objects time in state as per the monitors that roll-up within the availability discipline.

Entity health

Availability <= this you get

Configuration <= this you want

Performance <= ..

Security <= ..

“

O no , it looks like not. So yes it’s a real challenge. That the way we like it.

Solution

Since the availability report was intended to be used for this but at the end it looks like the SCOM program team decided to make it locked on ‘availability’ only. I know this because when you look into the report definition you will see:

So the report is using only the availability rollup as state calculation data. AND this parameter is hidden for gurus as us. How dear they

So we can solve it on several ways. The root solution is that we want to change the value ‘System.Health.AvailabilityState’ to ‘System.Health.PerformanceState’ or ‘System.Health.ConfigurationState’ or ‘System.Health.SecurityState’ to get the report state type we want.

1) export the report from report service and edit the hidden value to false. Import the report and open it in the SCOM console and edit the MonitorName value to for example System.Health.PerformanceState . Run the report and you are done.

2) make a normal report run using the non modified availability report and save it to a Management pack. Now export the MP and open it in notepad and edit the MP.

3) make a normal report run using the non modified availability report save it as favorite. Now open SQL enterprise and lookup the report in the table dbo.favoritereport . Change the ReportParameterValues with the changed parameters.

I know you are thinking right now… what would you do Michel…

I would go for option 1. Because I would also change the report definition to have the correct name as ‘Performance availability’ ect.. and save it also under a different name. Because you must be aware that if you only change the report value to hidden = false and don’t change the report file name….. The next time you import a new service pack or MP version it could be that your report is going to be overwritten… So said that go for the more save one and choose 2.

Let’s go!

1) So make the normal availability report in the SCOM console

2) Save it to a MP

3) Export the MP

4) Edit the MP with notepad

5) import it in scom. (leave the mp version number unchanged)

6) wait a few minutes and you will see the report in the console

Below the end result. Also notice that you can still click to sub report that that this report are also of the state type you wanted!.

Yes I know that you will have to do this for every 3 report types because you can’t change the monitor type runtime. At the end the decision is at you to use step 1 , 2 or 3.

The End

Every time I tell my self make a short blog post! But every time I notice that I am failing.. But who cares… (yes okay.. my wife)

The result could be as shown below. The Aggr_behind number shows you the aggregations that are not completed yet.

In this case with this high number we are having a serious problem. Okay then you just follow the my pervious blog post on how to solve this , this is for States missing but can also be applied for performance data. Look at the FIX: part. To kickoff the aggregation processing.

But if you see a performance data set number around 2. (See picture below) It means 2 aggregations have to be processed yet. This is what we want to see. So everything seems okay. But why are we missing the date period 01-02-2012 till 20-01-2012 ?

We could have 2 scenarios here:

1. The data was simply not provided to the DWH ?

2. The data was provided but due to stage/aggregation problems not processed.

For case 1 we have to look at the agents what went wrong. That is for this post out of scope.

For case 2 we have some solutions see below.

Case 2

First let me explain how the aggregation process works at helicopter view. I am sure I miss some details (so feel free to add / correct me on this!)

2. The DWH staging process processes this data by copying the RAW rows into a process table. Sometimes the table is simple renamed and recreated if the new RAW data count is less then a configured number. If you have a big number of new RAW rows the table rows will be copied in batches. This to minimize the transaction log impact. At last the RAW data is copied into the RAW data partitions tables.

3. The Standard Maintenance process generate the Aggregation sets that have to be processed in step 4. During this process there will be created aggregation process rows in the Aggregation history table with a Dirty Indication (DirtyInd) of 1.

4. The RAW staged partition data will be processed to aggregated hourly and daily data. When the aggregation is complete the Dirty Indication for that aggregation will be set on 0.

5. The stored procedure reads the just aggregated data.

6. Data received from step 5 will be used to generate the report for the end user.

So now knowing the data flow what could be wrong ?

The answer we have to search at the grooming process (?) yes, the grooming process. The data in the RAW partitions tables from step 2 has a grooming/retention period. This period is standard 10 days. So if your aggregation is broken for more than 10 days (and you didn’t detected this) you will LOOSE your RAW data and as a result the aggregation process will have nothing to aggregate. So no performance data, resulting in our root problem the date gap in the report.

Solution:

Pfff … nice all of this theory stuff but how do I fix this ?

Simply by : 😉

1.Manually insert the missing RAW data and kickoff the aggregation process. I will blog post on how to do this later. (would be after the MMS)

2.Prevent that this is going to happen again.

To prevent this you can increase the retention/grooming period from 10 days to lets say 30 days. Check if you have enough DB space first. Execute the query below:

Now you will have 30 days to solve your aggregation problems. Of course this is a workaround to get more air to breath during fixing your aggregation problems.

The best way is to monitor it pro active. Since we can monitor everything we create a monitor that checks the outstanding aggregations every 60 minutes and alerts when a threshold is hit. You can use the query from the analyze part in this post to do this. I would set the threshold on 10 so you will be notified if your aggregation process has a delay of 10 datasets (about 10h). If I have time before I’m going to the MMS I will blog post this extra monitor because with the normal DB watcher you can’t make this one. And of course I will use the VS Authoring extensions for this.

Short post on a very strange issue I solved this week. The DWH database was taking all the log space it could take. I some cases , when you have a lot of aggregations waiting due to a state change burst this would be normal. So you add some extra log space to the DWH and remove it after the processing has succeeded. But this time it was hungry and took 80GB+ on log space. So my alarm bells went on this is for sure not normal.

Analyze:

The output was showing that the log of the DWH database was 100% used. Okay no real new news we knew this already.

Now we must look why .

We check if the database is in simple mode. This is the default setting for the SCOM DWH. And yes this is configured correctly. So since a simple mode DB releases the log pages when the transaction is completed(committed or roll backed) it must be that a transaction isn’t completed or will never be.

Lets lookup this open transaction(s)

Execute SQL below to see the open transactions:

— open trans , if status =2 trans is still openDBCC LOGINFO

It returned over 60K on rows with status 2. So now we are sure its caused by a open transaction. So find out the guilty process causing this never closed transaction.

Execute SQL below to get the process:

— if process spid is given. Or if no spid is shown but a last old LSN is shown. look at the replicationDBCC OPENTRAN

Solution

Good readers should have noticed the words “Replicated Transaction Information” in the results. Hmm but I don’t use replication!! Even when I used SQL Server Management Studio to check the replications it did not show any repl. configuration.

After some Bing work I found the system stored procedure to force delete any replication configurations. So I executed the SQL below:

” Huston we got a problem!” when I run a availability report the data isn’t complete. I’m missing a huge number of days. The graph shows UP (Monitoring unavailable) But I am really sure the server was up and monitored !!

Analyze:

Don’t panic, we are going to solve this. (I hope..) First we are going to look up the days we are missing. Simply click on the white bar. And the detail report will be rendered.

Okay looks like we are missing the most data from of 4-3-2012. And we see strange gaps of data that is present.

Okay that’s what the report says, but I am a core stuff guy I check it this way:

Open an SQL session and connect to the DWH db. Run this query. The last aggregated data will be on the first row. So you know what the last data date is you have. We change the DateTime to the same datetime we used in the report.

So the last successful hourly aggregation was 02-03-2012 (dd-mm-yyyy). Hmmmm but when I look at the rendered report I see periods of data after this date ??? I must confess I really don’t have a idea now why

Now we have to find the root cause and fix this missing aggregations. Luckily we can enable debug information for the aggregation process so we can see more what going wrong.

Open SQL and run the query below to enable debugging for the State aggregation .

Since the maintenance for the DWH has a sequence run it means when some procedure before fails (lets say the event staging) the other won’t be hit. So I look in the debug table for other messages with ‘failed’ in the message.

Notice that we are now going a little of track , we main problem was the State report incomplete , but now we are looking at the Events. Just follow me.

Mmmm When I look at the error I see the debug procedure that writes to the debug log has a problem writing a debug message. Strange this error… So we have to find the real error. So I open the stored procedure “EventProcessStaging”. And there I found a BUG .. brrrr. The variable @InsertTableName is not set to a value before it is used as part of the debug message variable @MessageText. Because you can’t concat NULL to a string variable an exception is raised. I fixed this by moving the sql where this variable @InsertTableName is assigned to above the first use of the @InsertTableName variable. (this is for SCOM 2007 and 2012!) I raised already a bug request @Microsoft throughout the TAP program. This only occurs when you set the debuglevel > 3.

For you it simple means don’t set it above 3 or fix this stp own your own risk. (as I have done ;-0 ) In our case the debug level was already above 3 for the state dataset for the last month. So because the event processing was braking the total maintenance (bad architecture , sorry) all my state staging was stopped. And caused my empty reports.

So now we know it we can go back to the real issue. The missing states fix.

Set a enable = false override on the rule “Standard Data Warehouse Data Set maintenance rule” for all instances of “Standard Data Set”.

Now I am really sure no maintenance process is running.

And I run my own maintenance process every 1 min. Because I know catching up the state data aggregation will take some time and I don’t want to create problem’s for the other datasets (performance , events ..) I will also run the important ones in the same script.

now you check the debug log on regularly base to see if the state aggregation is completed.

You can also use the query below:

———————————————————————–— check first and last aggregation time from still to be processed data— first and last date must be equal———————————————————————–Declare @DataSet as uniqueidentifier Set @DataSet = (Select DataSetId From StandardDataSet Where SchemaName = ‘State’) Select AggregationTypeId, COUNT(*) as ‘Count’, MIN(AggregationDateTime) as ‘First’, MAX(AggregationDateTime) as ‘Last’ From StandardDataSetAggregationHistory Where DataSetId = @DataSet AND LastAggregationDurationSeconds IS NULL group by AggregationTypeId

So lets check if the process is running okay. Simply rerun the report. the output will be:

looks like its all going to be alright. Just be patient.

DO NOT FORGET:

after the states are complete to remove the overrides , other wise you will have for sure the same and more , problem again.

Not to be continued:

I really hope not. Because in my case we have a DWH size almost against 1TB and because of this size it can be very complex and tricky to solve this sort of problems. So if mr. Murphy is reading this , skip my place please …

In this post I will share some ideas I had a long time but never had time to realize it. It is just a showcase in the future I will of course extend this to a real production version.

The idea:

I wanted to do data mining on performance data that System Center Operations Manager SCOM has collected. So a data analyst could look at the data , do real-time / interactive actions to the data view perspectives and use forecasting to find quickly the bottlenecks to solve. Again a great predictive approach for the availability increment of your IT environment!

Solution(s):

Some solutions for this came up in my mind:

– writing my own SQL query’s to do this.

Disadvantage: complex to maintain and not end user (data analyst) friendly

– writing my own SQL reports to do this.

So combine the SQL query’s and use the output to render a report.

Disadvantage: In most cases a data analyst wants to change the perspectives at runtime. A report is static and needs to be rerendered every time and this is time consuming.

– Performance Point !

Yhea that sounds good. A few years ago Microsoft it self posted a demo to use performance point and SCOM DWH to do this. ( BI – Dashboard Integration )

Disadvantage: O course I tried this , It works but it is really hard to configure and extend and for sure if you aren’t a expert in maintaining SharePoint , SQL analyze services and OLAP.

Hmm … then I found the magic word: “PowerPivot”

Powerpivot is a really good tool for a data annalist. The pivot table views can be changed real-time and also aggregations can be changed on the fly. Since PowerPivot integrates smoothly with SharePoint it would mean the end user (data annalist) won’t have to install a client site software. Only a internet explorer is needed. (The SharePoint part I will blog in a future post.)

Now you have created a calculated field. Do the same for all other values you want to display in a graph.

At this point we have a pivottable with value data!

Pivot to Graph

Next we are going to create the trend graphs for the annalist. In the PowerPivot windows Press the PivotTable Icon > “Two Charts (Vertical)”

Now there will be in excel created 3 sheets. Two sheets containing the data from the pivot table and 1 sheet containing the 2 graphs. I rename the sheets to better names:

Now we are going to configure the graphs. Select in Excel the Graph Sheet (diskspace Analyze). Click on the top graph. On the right the field select windows will be shown. Select the fields show in the picture below and drag it to the correct pane on the bottom. (click on the screenshot to enlarge)

Some short explanation on what you just did:The field in the Axis Fields is the X axis and shows the date times. The field dropped in the Values pane will display on the Y axis, this will be the disk space value. This are the 2 most important fields. But…. what if a computer has 2 logical disks then the values will be added to one line instead of 2 lines. So we need to get categories configured. This is done by the fields in the Legend pane. So now we have a graph but it shows all the data from the 2 performance rules. That not really interactive, we want to select the rules runtime. So we drag the RuleDisplayName into the Slicers vertical pane and whala we see our interactive selection list. We do the same for the interactive selection of the computers. Drag the Path field into the Slicers Horizontal pane.

Last we play with the Graph styles. Select the Graph and then change the style.

Select the type “Line with markers”

Next we change the looks of the graph. Select the graph and play with the Layout options.

The result will be:

Cool it is isn’t it ???? Now you can select the logical disks interactively and the graph will be rebuild.

To be continued…

Because this post is becoming to long I will post in my next blog about some extras a data annalist has to have.– Predictive trend line so we can see when the disk space will be 0 in the future.

– Monthly / Weekly / Yearly Aggregations so we can drill in and out of the data. SCOM is only providing daily and hourly aggregations but if you have a lot of data you will have to look at a higher perspective.

– Real data mining so you can detect strange behaviors in your data.

Below a teaser..

I hope you see how easy it is to use PowerPivot to analyze your SCOM data.

If you want the excel source file you will have to leave a message below and maybe it will share it …