Book Review: Microsoft System Center – Operations Manager Field Experience

The chapter(s) that I found most helpful were basically all of them! The entire book is filled with very useful points, tips, and insights.

I’ve decided to share my highlights from reading this specific publication, in case the points that I found of note/interest will be of some benefit to someone else. So, here are my highlights (by chapter). Note that not every chapter will have highlights (depending on the content and the main focus of my work).

Chapter 01: The Role of Operations Manager

By default, every installation of Operations Manager is not registered; it’s installed as an evaluation version. This is true even if you installed it from volume licensing media. To register your installed environment, you can use the Windows PowerShell cmdlet Set-SCOMLicense

Since Operations Manager was designed with built-in high availability when you have two management servers, having two or more management servers is recommended. That way, if one goes down, failover is possible. To determine which management server is down and which is still up and running, the server running the Operational Database serves as a watcher node, similar to a witness in a failover cluster, and has a majority in deciding which one is the functional management server.

If the primary management server for the agent goes down, the agent tries to connect to one of the management servers defined as a failover. You can define a failover management server through the console by using AD integration or by using the Set-SCOMParentManagementServer cmdlet with the –FailoverServer parameter.

Azure Operational Insights provides you with the combined knowledge of the Microsoft Support Engineers, who are responsible for adding rules to the product. These rules work like an additional management pack that is managed centrally by Microsoft

By default, on Windows Server 2008 R2 and higher, power management is set to Balanced. In some cases, you may experience degraded overall performance on a Windows Server machine when running with the default power plan. This is most noticeable on the SQL server running the Operational Database, where the Balanced power setting results in slow console performance since most of the background actions in the console are SQL query-based. The issue may occur irrespective of platform and may be exhibited in both physical and virtual environments

Another power management setting to consider is described in the Knowledge Base article “Degraded overall performance on Windows Server 2008 R2” at http://support.microsoft.com/kb/2207548. Note that even though this article describes the problem in the context of Windows Server 2008 R2, the strategies described are also valid for later versions of Windows Server

You can find some important information about the power management setting on a network adapter at http://support.microsoft.com/kb/2740020. As stated in the Knowledge Base article, you might want to disable the Allow The Computer To Turn Off This Device To Save Power network adapter power management setting on servers

The D drive on an Azure IaaS VM is a temporary disk, using local storage from the actual hardware that is hosting your VM. This means that everything on this drive will be lost in the case of a reboot, so don’t use it to store anything that you want to keep.

Putting a Gateway server in a remote subnet to compress the outgoing data is no longer recommended. The agent itself does an equally good job of compressing the data in Operations Manager 2012 R2. However, the other reasons for installing a Gateway server in a remote subnet are still valid, for instance to reduce the administrative overhead and to minimize the number of certificates that are needed. More information can be found at http://technet.microsoft.com/en-us/library/hh212823.aspx.

When you install Operations Manager on machines running antivirus software, you should configure the antivirus software so that the following directories are excluded:

The Health Service State folder on every management server and every agent

The data and log file directories where your databases are located

Excluding the actual binary files, such as MonitoringHost.exe, is not recommended.

The best way to configure SQL Server in your Operations Manager environment is to keep it simple. The default settings for Operations Manager should be left alone unless you have very specific reasons to change them

Neither auto grow nor auto shrink are recommended for the Operational Database because it needs 50 percent of free space at all times to perform maintenance and indexing tasks. If the database doesn’t have enough free space, the scheduled maintenance tasks might fail. Operations Manager will alert you when there is less than 40 percent of free space.

The SQL Server edition you are using also has an important role when you are considering auto grow. SQL Server Standard edition can cause the database tables to lock out when auto grow is configured. However, this does not occur with SQL Server Enterprise edition. This applies to both the Operational Database and the Data Warehouse Database.

Auto grow is supported (though not recommended), when enabled as an insurance policy against the database’s file filling up. When using auto grow on the databases, it is better to set it to increase by a fixed amount rather than a percentage. The fixed increase amount should be no more than 500 MB or 1 GB in growth to limit the blocking that might occur during the expansion process. It is also useful to configure a maximum possible size to prevent the databases from filling up the disk they reside on.

In SQL Server, data files can be initialized instantaneously. This allows for fast running of file operations. Instant file initialization reclaims used disk space without filling that space with zeros. Instead, disk content is overwritten as new data is written to the files. Log files cannot be initialized instantaneously. Instant file initialization is available only if the SQL Server (MSSQLSERVER) service account has been granted the right to perform volume maintenance tasks (SE_MANAGE_VOLUME_NAME). Members of the Windows Administrator group have this right and can grant it to other users by adding them to the Perform Volume Maintenance Tasks security policy.

As a general rule, set the combined value over all the instances to about 2 GB less than the actual memory available on the host. This will secure enough available memory for the operating system to function optimally.

Another low-effort, high-reward action is splitting up the files that comprise the TempDB. There’s only one TempDB per SQL Server instance, so it’s often a performance bottleneck. Make sure that the disk subsystem that holds the TempDB files is up to the task. Increase the number of data files that make up your TempDB to maximize disk bandwidth and to reduce contention in allocation structures

Generally, if the number of logical processors is less than or equal to eight, use the same number of data files as logical processors. If the number of logical processors is greater than eight, use eight data files; if contention continues, increase the number of data files by multiples of four (up to the number of logical processors) until the contention is reduced to acceptable levels or make changes to the workload/code. It is also best to spread these different files over multiple disk systems and to keep all files the same size.

The log file for TempDB should remain a single file at all times.

It is also recommended that you size the TempDB according to the Operations Manager environment. The default size for TempDB is 8 MB with a 1-MB log file. Every time you restart SQL, it will re-create this 8-MB file from the model database

Some SQL teams automatically assume that all databases should be set to Full recovery model. This requires backing up the transaction logs on a regular basis, but gives the added advantage of restoring up to the time of the last transaction log backup. This approach does not make as much sense for Operations Manager

It is best practice to use a domain account to run your SQL Server service (MSSQLSvc). The problem with this is that if your SQL Server service is not running as either the server’s system account or a domain administrator, SQL Server cannot register its Service SPN when the service is started. If the SQL Server service does not have sufficient rights, you can use the SETSPN tool manually as a domain administrator to register the necessary SPNs.

By default, Operations Manager does self-maintenance. Since most Operations Manager administrators are not SQL Database Administrators (DBAs), Microsoft implemented several rules in Operations Manager to automatically keep the databases optimized. These maintenance tasks are defined as system rules in the Operations Manager management pack, one of the management packs installed by default when you install Operations Manager. Since these maintenance tasks run automatically, be careful that your own maintenance tasks do not conflict with the built-in system rules (if you or the DBA decide to implement additional maintenance).

For the Operations Manager Data Warehouse, an automatic maintenance job runs every 60 seconds. This job, coming from the Standard Data Warehouse Data Set maintenance rule, does many things, of which re-indexing is only one. All the necessary tables are updated and re-indexed as needed. When a table is 10 percent fragmented, the job re-organizes it. When the table is 30 percent or more fragmented, the index is re-built. Therefore, especially since the built-in maintenance runs every 60 seconds, there is no need for a DBA to run any UPDATE STATISTICS or DBCC DBREINDEX maintenance commands against this database.

By default, the block size of any disk less than 16 TB is 4 K. Since SQL Server reads in 64-K increments, it is best practice to format the disk containing the SQL data and log files with 64-K block size. You can only set this allocation unit size when you format the disk.

If you use the wrong collation, searches may be less effective or not work at all, sorting might produce unexpected results, and other problems can happen when inserting or retrieving data.

If a SQL Server collation other than SQL_Latin1_General_CP1_CI_AS is specified when you create the database, you will have to reinstall Operations Manager and create another database to fix this problem because you cannot change the collation after installing Operations Manager.

The registry key path where settings for the Data Access Layer are included is:

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL

The DWORD setting, called DALInitiateClearPool, is used by the Data Access Service to control whether to reconnect to the database after a period of unavailability. The default value is 0 (disabled). The recommendation is to enable this feature by setting the value to 1 (decimal).

The Persistence Manager feature is used by the Health Service to read and write data to the local database. The local or cache database is called HealthServiceStore.edb, and it is a Microsoft Jet Database Engine database. The registry key path for settings belonging to this feature is:

HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters

The setting responsible for how often Persistence Manager writes data from memory to the disk is called Persistence Checkpoint Depth Maximum of type DWORD and is measured in bytes. The default value for this setting is 20971520 (decimal) bytes. On management servers that handle a large number of objects not managed directly by agents, such as SNMP Devices, Groups, URL Monitors, Cross-Platform Agents, and so on, you may need to increase this value to relieve disk pressure. The recommended value is 104857600 (decimal).

Health Manager is used by the Health Service to calculate and track the health state of each monitor of each object it monitors. The registry path for settings belonging to this feature is:

HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters

The important setting for the Health Manager is State Queue Items of type DWORD. This sets the maximum size (in bytes) of the state data queue. If the value is too small or if there are too many workflows running (based on the number of objects being managed), there could be possible state change data loss. The default value for this setting is calculated by the Health Service on startup based on how many objects it needs to manage. For agents in a small environment, this value is set to 1024 (decimal). The value is set to 10240 (decimal) on management servers in a mid-size environment. For large environments, on management servers that manage many objects, the default is 25600 (decimal). The recommendation is to double these default values, depending on where it is needed—for an agent that manages a lot of objects or a management server.

Do not change the settings for Pool Manager unless advised by Microsoft Support after a proper analysis of the environment, behavior of the resource pools, and load on the management servers. If these settings are changed, it is important to make sure that they are changed to the same value on all management servers in the environment.

To remove a server from the resource pools with automatic membership, first set the group membership to manual (automatic is the default). This can be done only from within Windows PowerShell as follows:

Check for new management pack versions of any installed management packs. Also check the management pack guides of newly released management packs to determine whether they meet the requirements of your organization and are suitable for your environment.

Review the baselines (performance counters) to assess the ongoing performance of the Operations Manager environment as new agents and management packs are added.

Chapter 02: Best Practices for Working with Management Packs

The product group that creates the product also makes the management packs, so you will have the combined knowledge of the people who created the product to assist you with monitoring your applications in the most recommended way

When you seal a management pack, the file is digitally signed by the provider and the user knows that it hasn’t been modified since then.
To upgrade a sealed management pack, the same key must be used or the upgrade will fail

Summary of best practices

In summary, here is a list of the most important things to consider when working with management packs:

Class properties you choose should change values as seldom as possible, close to never.

Don’t use Operations Manager for software inventory (System Center Configuration Manager is built to do that), and don’t collect too many properties.

Monitors should change their state as seldom as possible. They should not be too sensitive, and the related issue that is described in the alert should be resolved in a more permanent manner.

The type space should be kept as small as possible. Import or create only what you need and delete what you do not use.

Windows PowerShell scripts that connect to the Data Access Service should be kept to a minimum. At least try to develop them in a way that loads as few objects as possible by using selection criteria for the Operations Manager cmdlets.

Don’t over-use maintenance mode. If there is no way around it, reduce database grooming settings for state change events data.

Targets for workflows should be as specific as possible. Use seed classes with lightweight discovery rules for custom application monitoring.

Not all available management packs are divided into Discovery, Monitoring, and Presentation parts. If everything is in one management pack file, the following explanation is still valid. However, since dividing a management pack into these three different parts is best practice for building your own management packs, you should follow the example set by the SQL Server management pack.

When creating your own management packs, you shouldn’t use a broad class for all of your discoveries because it will negatively impact the performance of Operations Manager. Use a broad class only for the base discovery or the seed discovery.

Thresholds don’t appear in the overview window of MP Viewer or in the Excel or HTML file when you export the management pack. To view them, select a rule or a monitor, and then click the Knowledge, Alert Description, or Raw XML tab (for monitors) or click the Knowledge or Raw XML tab (for rules) in the bottom right pane. When you select Raw XML, you will see the actual XML code that makes up the management pack. In this raw XML code, you can also see the thresholds

You cannot change the target for an override using the Operations Manager console. Instead, you must note the changes you make in the specific override, delete it, and then re-create it with the new target

If groups are created with extended authoring tools (or directly in XML using your preferred XML editor), they can and should be based on Windows Computer objects hosting special applications, for instance, a Windows Computer group that contains only Windows computers based on a discovered custom special application class. For notifications, the corresponding Health Service Watcher objects could be added to the group. This is necessary because you need the Health Service Watcher objects for Operations Manager self-monitoring alerts like Heartbeat Failures or Computer Not Reachable to be included too. Also remember to add cluster objects (if you need cluster-based alerts), which are not hosted by Windows Computer.

Operations Manager includes a SharePoint web part that displays selected dashboards from the web console. The SharePoint farm must be running SharePoint 2013, SharePoint Server 2010 Standard, SharePoint Server 2010 Enterprise, or SharePoint Foundation 2010, and the procedure to configure it is described at https://technet.microsoft.com/en-us/library/hh212924.aspx.

Chapter 04: Troubleshooting your Operations Manager Environment

The most basic thing to check is that the information event 6022 is being logged periodically, which indicates that the HealthService is running at least some workflows (through MonitoringHost processes) and is not in a hung state or something similar.

It is more than enough to go through the events from the past 6 to 10 hours because if there is a failure at some point, that failure will repeat itself often.

Usually, you should first filter the event log just on Error and Warning events (Operations Manager never triggers a Critical level event).

It is good to go through each Error or Warning event and make an analysis along these lines:

What is the frequency of the event?

What is the exact event description?

For events with the same event ID, are these really the exact same event based on a careful comparison of the event description?

If you see a problem event for some workflow that you know should run every 10 minutes, is the last such event fresh or is it too old, maybe indicating this was a one-time problem?

Is there one or more events that seem to be spamming the event log? For example, do you see the same event 50 times in 1 second, or something similar?

There can be two (or more) events that have the same event ID and exact same event description, but with a very specific and important difference: a different error code in the description

One of the most important aspects of maintaining a healthy and performant Operations Manager environment is management pack tuning. Each time you import a new management pack, you need to monitor the data it collects and how it behaves in the following one to two weeks.

Another reason for a big StateChangeEvent table is state change events, which are older than the data grooming setting for state changes (default 7 days). This can happen if you manually (or via some automated/scripted method) close alerts without resetting the monitors that raised them. It is against best practice to do this because the grooming stored procedure to clean-up state changes does not also delete state changes that belong to a monitor that is not in the Healthy state. Additionally, a high number of state changes might cause the stored procedure to time out and not be able to delete everything.

Many different issues on either management servers or agents are caused by known problems with certain versions of the Windows operating system. Because of this, a Knowledge Base article listing recommended Windows operating system (version dependent) hotfixes and update rollups is available at http://support.microsoft.com/kb/2843219.

A good Knowledge Base article to help troubleshoot the different scenarios for agents that are displayed as gray is available at http://support.microsoft.com/kb/2288515. One of these scenarios also describes the presence of warning event 2115, which would most likely appear on management servers or gateway servers and may involve performance problems. Another great Knowledge Base article for troubleshooting this issue in detail is available at http://support.microsoft.com/kb/2681388.

Chapter 05: Using Operations Manager in Cloud Environments

Visual Studio Web Test monitoring is the other option for Global Service Monitor. It gives you the ability to import more extensive Global Service Monitor web tests that have been created using Visual Studio. Using Visual Studio Web Test monitoring, you can record actions to take against your external-facing applications and validate against multiple criteria and multiple websites at the same time. With Visual Studio Web Test monitoring, you can import a web test that was built by a developer using Visual Studio. Transactions are supported, as are authentication actions.

Global Service Monitor needs a proxy or a server that has ports opened to the Internet. If your proxy server needs authentication, you will need to follow the steps described in the Microsoft Knowledge Base article at http://support.microsoft.com/kb/2900136/en-us. From a security perspective, everything that is sent over the Internet is encrypted and is also stored encrypted on the Microsoft Azure watcher nodes that are managed by Microsoft.

Even in the full subscription, there is a limit of 25 web tests. This can be changed by contacting Microsoft support

With Azure Operational Insights, you can see where your virtual machine infrastructure needs more resources and where it’s under-utilized. You can also use “what-if” scenarios to enhance your planning options.

A quicker way to create the virtual machine with fewer configuration options, similar to the quick create option in the management portal, is to use the New-AzureQuickVM Windows PowerShell cmdlet. In the script center at http://azure.microsoft.com/en-us/documentation/scripts/, you can find more scripts that can automatically create virtual machines with multiple data disks.

The operating system disk has read and write caching enabled. This is the best setting for the operating system, but for the Active Directory database (NTDS.DIT) and log files, it is recommended that you attach another disk with only read caching enabled to the virtual machine and put the Active Directory files on that drive.