Troubleshooting agent problems in Operations Manager

Troubleshoots issues that Operations Manager agents have problem connecting to the Management Server of System Center 2012 Operations Manager (OpsMgr 2012 and OpsMgr 2012 R2) and later versions.

Little background information

To learn more about Operations Manager agent and how they communicate with management servers, see the following sections in the article Operations Manager Key Concepts:

Agents

Communication between agents and management servers

Who is it for?

Admins of System Center 2012 Operations Manager who help resolve agent connectivity issues.

How does it work?

We’ll begin by asking if the Health service is running without errors when the problem occurs. If there is no problem with the Health service, we’ll take you through a series of steps that are specific to your situation to resolve your issue.

Estimated time of completion:

15-30 minutes.

1

Checking the Health Service

Whenever you’re faced with connectivity problems in Operations Manager, first make sure that the Health service is running without errors on both the client agent and the Management Server.

To determine whether the service is running, follow these steps:

Press the Windows key+R.

In the Run box, type services.msc and press Enter.

Find the Microsoft Monitoring Agent service and then double-click it to open the Properties page.Note: In System Center 2012 Operations Manager, System Center 2012 Operations Manager Service Pack 1 and System Center Operations Manager 2007 R2, the service name is System Center Management.

Make sure that the Startup type is set to Automatic.

Check whether Started is displayed in the Service status area. If Started is not displayed, click Start.

Did this solve your problem?

Yes

No

0

Checking the Health Service

Whenever you’re faced with connectivity problems in Operations Manager, first make sure that the Health service is running without errors on both the client agent and the Management Server.

To determine whether the service is running, follow these steps:

Press the Windows key+R.

In the Run box, type services.msc and press Enter.

Find the Microsoft Monitoring Agent service and then double-click it to open the Properties page.Note: In System Center 2012 Operations Manager, System Center 2012 Operations Manager Service Pack 1 and System Center Operations Manager 2007 R2, the service name is System Center Management.

Make sure that the Startup type is set to Automatic.

Check whether Started is displayed in the Service status area. If Started is not displayed, click Start.

Did this solve your problem?

Yes

No

0

Checking Antivirus Exclusions

If the Health service is up and running, the next thing we should do is confirm that antivirus exclusions are properly configured. For the latest information about recommended antivirus exclusions for Operations Manager, please see the following:

Checking for Network Issues

In Operations Manager, the agent computer must be able to successfully reach and connect to TCP port 5723 on the Management Server. If this is failing you will likely receive Event ID 21016 and Event ID 21006 on the client agent.

In addition to TCP port 5723, the following ports must also be enabled:

TCP and UDP port 389 for LDAP

TCP and UDP port 88 for Kerberos authentication

TCP and UDP port 53 for DNS

In addition to the above, we must also ensure that RPC communications complete successfully over the network. If there are problems with RPC communication it will usually manifest itself when pushing an agent from the OpsMgr management server. RPC communication problems will usually cause the client push to fail with an error similar to the following:

This typically occurs when either nonstandard ephemeral ports are being used, or when the ephemeral ports are blocked at a firewall. For example, if nonstandard high range RPC ports have been configured, a network trace while pushing the agent will show a successful connection to RPC port 135 followed by a connection attempt using a nonstandard RPC port such as 15595 as shown below.

In this example, since the port exemption for this non-standard range was not configured on the firewall, the packets are dropped and the connection fails.

In Windows Vista and above the RPC high range ports are 49152-65535 so that’s what we want to look for. To verify whether this is your issue, run the following command to see what RPC high port range is configured:

If you see a different start port then the problem may be that the firewall is not configured correctly to allow traffic on those ports. You can change the configuration on the firewall or you can run the command below to set the high range ports back to their default values:

Netsh int ipv4 set dynamicport tcp start=49152 num=16383

Note that you can also configure the RPC dynamic port range via the registry. See the following article for more information:

If everything appears to be configured correctly but you still experience the error above, it may be that one of the following conditions are true:

DCOM has been restricted to a certain port. To verify, open dcomcnfg.exe and traverse to dcomcnfg -> My Computer –> Properties –> Default Protocols and ensure that there is not custom setting there.

WMI is configured to use a custom endpoint. To check if you have a static endpoint configured for WMI, open dcomcnfg.exe and traverse to dcomcnfg -> My Computer –> DCOM Config -> Windows Management and Instrumentation –> Properties -> Endpoint and ensure that there is no custom setting here.

The agent computer is running the Exchange 2010 CAS role. The Exchange 2010 Client Access Service changes this port range to 6005 through 65535. The range was expanded to provide sufficient scaling for large deployments. Do not change these port values without fully understanding the consequences of doing such.

More Information

For more information regarding port and firewall requirements please see the Firewalls section in the following document:

You can also find the minimum required network connectivity speeds in the same document.

Final Notes

Troubleshooting network problems is an extremely large issue unto itself, so it’s best to consult a networking engineer if you suspect that an underlying network problem is causing your agent connectivity issues in Operations Manager. We also have some basic, generalized network troubleshooting information available from our Windows Directory Services support team available here:

Checking for Certificate Issues on the Gateway Server

Operations Manager requires that mutual authentication be performed between client agents and Management Servers prior to the exchange of information between them. To secure the authentication process between the two, the process is encrypted. When the agent and the Management Server reside in the same Active Directory domain, or in Active Directory domains that have established trust relationships, they make use of the Kerberos v5 authentication mechanisms provided by Active Directory. When the agents and Management Servers do not lie within the same trust boundary, other mechanisms must be used to satisfy the secure mutual authentication requirement.

In Operations Manager this is accomplished through the use of X.509 certificates issued for each computer. If there are many agent-monitored computers, this can result in high administrative overhead for managing all those certificates. In addition, if there is a firewall between the agents and management servers, multiple authorized endpoints must be defined and maintained in the firewall rules to allow communication between them.

To reduce this administrative overhead, Operations Manager has an optional server role called the Gateway Server. Gateway Servers are located within the trust boundary of the client agents and can participate in the mandatory mutual authentication. Because gateways lie within the same trust boundary as the agents, the Kerberos v5 protocol for Active Directory is used between the agents and the Gateway Server, and each agent then communicates only with the Gateway Servers that it is aware of.

The Gateway Servers then communicate with the Management Servers on behalf of the clients. To support the mandatory secure mutual authentication between the Gateway Server and the Management Servers, certificates must be issued and installed but only for the gateway and Management Servers. This reduces the number of certificates required, and in the case of an intervening firewall, it also reduces the number of authorized endpoints that need to be defined in the firewall rules.

The takeaway here is that if the client agents and Management Servers do not lie within the same trust boundary, or if a Gateway Server is used, the necessary certificates must be installed and configured correctly for agent connectivity to function properly. Here are some key things to check:

Confirm that the gateway certificate exists in Local Computer/Personal/Certificates on the Management Server to which the gateway or agent is connecting.

Confirm that the root certificate exists in Local Computer/Trusted Root Certification Authorities/Certificates on the Management Server to which the gateway or agent is connecting.

Confirm that the root certificate exists in Local Computer/Trusted Root Certification Authorities/Certificates on the Gateway Server.

Confirm that the gateway certificate exists in Local Computer/Personal/Certificates on the Gateway Server. Confirm that the gateway certificate exists in Local Computer/Operations Manager/Certificates on the Gateway Server.

Confirm that HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings\ChannelCertificateSerialNumber exists and has the value of the certificate (from the Local Computer/Personal/Certificates folder within the details in the Serial number field) reversed within it on the Gateway Server.

Confirm that HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings\ChannelCertificateSerialNumber exists and has the value of the certificate (from the Local Computer/Personal/Certificates folder within the details in the Serial number field) reversed within it on the Management Server to which the gateway or agent is connecting.

TIP You might receive the following Event IDs in the Operations Manager Event Log when there is an issue with certificates:

20050

20057

20066

20068

20069

20072

20077

21007

21021

21002

21036

For details on how certificate based authentication functions in Operations Manager, as well as instructions on how to obtain and configure the proper certificates, please see the following:

Checking for a Disjointed Namespace on the Client Agent

A disjoint namespace occurs when the client computer has a primary Domain Name Service (DNS) suffix that does not match the DNS name of the Active Directory domain that the client belongs to. For example, a client that uses a primary DNS suffix of corp.contoso.com in an Active Directory domain that is named na.corp.contoso.com is using a disjoint namespace.

When the client agent and/or the Management Server has a disjointed namespace, Kerberos authentication can fail. Note that this is really an Active Directory issue and not a System Center Operations Manager issue, however it does affect agent connectivity.

Method 1

Manually create the appropriate SPNs for the affected computer accounts and include the host SPN for the FQDN together with the disjointed name suffix (HOST/machine.disjointed_name_suffix.local). Also update the DnsHostName attribute for the computer to reflect the disjointed name (machine.disjointed_name_suffix.local) and enable registration for the attribute in a valid DNS zone on the DNS servers that Active Directory uses.

Method 2

Correct the disjointed namespace. To do this, change the namespace in the affected computer’s properties to reflect the FQDN of the domain to which the computer belongs. Please make sure that you are fully aware of the consequences of doing this prior to making any changes in your environment. For more information please see the following:

Checking for a Slow Network Connection

If the client agent is running across a slow network connection, it may encounter connectivity issues due to the fact that there is a hard-coded timeout for authentication. To resolve this issue, install System Center 2012 Operations Manager SP1 Update Rollup 8 (assuming you’re not already on R2) and then manually change the timeout value.

The UR8 update increases the server time out to 20 seconds and the client time out to 5 minutes.

For more information on UR8 for System Center 2012 Operations Manager Service Pack 1 please see the following:

Note that this issue can also occur when there are time synchronization issues between the client agent and the Management Server.

Did this solve your problem?

Yes

No

0

Checking for OpsMgr Connector Problems

If everything else checks out, check the Operations Manager Event Log for any error events generated by OpsMgr Connector. Common causes and resolutions for various OpsMgr Connector events are listed below.

Select the option that best describes your scenario.

Event IDs 21006 and 21016 appear on the client agent

Event ID 20057 appears on the Management Server

Event IDs 2010 and 2003 appear on the client agent

There is Event ID 20070 combined with Event ID 21016

Event ID 21023 appears on the client agent, while Event IDs 29120, 29181 and 21024 appear on the Management Server

There are other OpsMgr Connector Event IDs not listed above

There are no OpsMgr Connector Events logged in the Operations Manager Event Log

0

Event IDs 21006 and 21016 appear on the Client Agent

Examples of these events are shown below.

Source: OpsMgr ConnectorDate: TimeEvent ID: 21006Task Category: NoneLevel: ErrorKeywords: ClassicUser: N/AComputer: ComputerNameDescription: The OpsMgr Connector could not connect to <yourManagementServer>:5723. The error code is 10060L (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.). Please verify there is network connectivity, the server is running and has registered its listening port, and there are no firewalls blocking traffic to the destination.

=====

Log Name: Operations ManagerSource: OpsMgr ConnectorDate: TimeEvent ID: 21016Task Category: NoneLevel: ErrorKeywords: ClassicUser: N/AComputer: ComputerNameDescription: OpsMgr was unable to set up a communications channel to <yourManagementServer> and there are no failover hosts. Communication will resume when <yourManagementServer> is available and communication from this computer is allowed.

Troubleshooting

Usually these event IDs are generated because the agent has not yet received configuration. After a new agent is added and before it is configured, this event is common. Note that Event 1210 in the agent's Operations Manager event log indicates that the agent received and applied configuration. You receive this event after communication is established.

You can use the following methods to troubleshooting this issue:

If auto-approval for manually installed agents is not enabled, confirm that the agent is approved.

Ensure that the following ports are enabled for communication:

5723 and TCP and UDP port 389 for LDAP.

TCP and UDP port 88 for Kerberos authentication.

TCP and UDP port 53 for DNS server.

This event can potentially indicate that Kerberos authentication is failing. Check for Kerberos errors in the Event Logs and troubleshoot as appropriate.

Check if the DNS suffix has an incorrect domain. For example, the computer is joined to domain1.com but the primary DNS suffix is set to domain2.com.

Make sure the default domain name registry keys are correct. To verify, make sure that the following registry keys match your domain name:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DomainAlso be aware that a duplicate Service Principal Name (SPN) for the Health service can also cause Event ID 21016. To find the duplicate SPN run the following command:setspn -F -Q MSOMHSvc/<fully qualified name of the Management Server in the event>If duplicate SPNs are registered, you must remove the SPN for the account that is not being used for the Management Server Health service.

Did this solve your problem?

Yes

No

0

Event ID 20057 appears on the Management Server

An examples of this event is below.

Log Name: Operations ManagerSource: OpsMgr ConnectorDate: timeEvent ID: 20057Task Category: NoneLevel: ErrorKeywords: ClassicUser: N/AComputer: ComputerNameDescription:Failed to initialize security context for target MSOMHSvc/******The error returned is 0x80090311(No authority could be contacted for authentication.). This error can apply to either the Kerberos or the SChannel package.

Troubleshooting

Event IDs 21006, 21016 and 20057 are usually caused by firewalls or network problems that are preventing communication over the required ports. To troubleshoot this issue, check the firewalls between the client agent and the Management Server. The following ports must be open to enable correct authentication and communication:

Log Name: Operations ManagerSource: HealthServiceDate: timeEvent ID: 2003Task Category: Health ServiceLevel: InformationKeywords: ClassicUser: N/AComputer: ComputerNameDescription:No management groups were started. This may either be because no management groups are currently configured or a configured management group failed to start. The Health Service will wait for policy from Active Directory configuring a management group to run.Event Xml:<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">System><Provider Name="HealthService" /><EventID Qualifiers="16384">2003/<EventID><Level>4</Level><Task>1</Task><Keywords>0x80000000000000</Keywords><TimeCreated SystemTime="2015-02-21T21:06:04.000000000Z" /><EventRecordID>84156</EventRecordID><Channel>Operations Manager</Channel><Computer>ComputerName</Computer><Security /></System><EventData></EventData></Event>

Troubleshooting

If the agent is using Active Directory assignment, the event logs will also indicate a problem communicating with Active Directory.

If you see these event logs, confirm that the client agent can access Active Directory. Check firewalls, name resolution and general network connectivity.

Did this solve your problem?

Yes

No

0

There is Event ID 20070 combined with Event ID 21016

Examples of these events are included below.

Log Name: Operations ManagerSource: OpsMgr ConnectorDate: 6/13/2014 10:13:39 PMEvent ID: 21016Task Category: NoneLevel: ErrorKeywords: ClassicUser: N/AComputer: *******Description:OpsMgr was unable to set up a communications channel to *******and there are no failover hosts. Communication will resume when *******is available and communication from this computer is allowed.

=====

Log Name: Operations ManagerSource: OpsMgr ConnectorDate: 6/13/2014 10:13:37 PMEvent ID: 20070Task Category: NoneLevel: ErrorKeywords: ClassicUser: N/AComputer: *******Description:The OpsMgr Connector connected to *******, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

Troubleshooting

When you see these events, it indicates that authentication occurred but then the connection was closed. This usually occurs because the agent hasn’t been configured. To verify this, check whether Event ID 20000 ("A device which is not part of this management group has attempted to access this health service") is received on the Management Server.

These event logs can also occur if client agents are stuck in a Pending status and not visible in the console.

To verify, run the following command to check whether the agents are listed for manual approval:

Get-SCOMPendingManagement

If so, you can resolve this by running the following command to manually approve the agents:

Get-SCOMPendingManagement| Approve-SCOMPendingManagement

Did this solve your problem?

Yes

No

0

Event ID 21023 appears on the Client Agent, while Event IDs 29120, 29181 and 21024 appear on the Management Server

Some examples of these events are included below.

Log Name: Operations ManagerSource: OpsMgr ConnectorEvent ID: 21023Task Category: NoneLevel: InformationKeywords: ClassicUser: N/AComputer: ******Description:OpsMgr has no configuration for management group ***** and is requesting new configuration from the Configuration Service.

NOTE The default value for the placeholder nn is 30 seconds. You can change this value to control the timeout for delta synchronization.

Did this solve your problem?

Yes

No

0

There are other OpsMgr Connector Event IDs not listed above

Other OpsMgr Connector error events and the corresponding troubleshooting suggestions are listed below.

Event

Description

More information

20050

The specified certificate could not be loaded because the Enhanced Key Usage that is specified does not meet OpsMgr requirements. The certificate must have the following usage types: %n %n Server Authentication (1.3.6.1.5.5.7.3.1)%n Client Authentication (1.3.6.1.5.5.7.3.2)%n

Confirm the object identifier on the certificate.

20057

Failed to initialize security context for target %1 The error returned is %2(%3). This error can apply to either the Kerberos package or the SChannel package.

Check for duplicate or incorrect SPNs.

20066

A certificate for use with Mutual Authentication was specified. However, that certificate could not be found. The ability for this Health Service to communicate will likely be affected.

Check the certificate.

20068

The certificate that is specified in the registry at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings cannot be used for authentication because the certificate does not contain a usable private key or because the private key is not present. The error is %1(%2).

Check for a missing or unassociated private key. Investigate the certificate. Re-import the certificate, or create a new certificate and import.

20069

The specified certificate could not be loaded because the KeySpec must be AT_KEYEXCHANGE

Check the certificate. Check the object identifier on the certificate.

20070

The OpsMgr Connector connected to %1. However, the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server or that the server has not received configuration. Check the event log on the server for the presence of 20000 events. These indicate that agents that are not approved are trying to connect.

Authentication occurred but the connection was closed. Confirm that ports are open and check agent pending approval.

20071

The OpsMgr Connector connected to %1. However, the connection was closed immediately without authentication occurring. The most likely cause of this error is a failure to authenticate either this agent or the server. Check the event log on the server and on the agent for events that indicate a failure to authenticate.

Authentication has failed. Check firewalls and port 5723. The agent computer must be able to reach port 5723 on the Management Server. Also confirm that TCP & UDP port 389 for LDAP, port 88 for Kerberos and port 53 for DNS are available.

20072

The remote certificate %1 was not trusted. The error is %2(%3).

Check whether the certificate is located in the trusted store.

20077

The certificate that is specified in the registry at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings cannot be used for authentication because the certificate cannot be queried for property information. The specific error is %2(%3).%n %n. This typically means that no private key was included with the certificate. Please double-check to make sure that the certificate contains a private key.

There is a missing or unassociated private key. Investigate the certificate. Re-import the certificate, or create a new certificate and import.

21001

The OpsMgr Connector could not connect to %1 because mutual authentication failed. Verify that the SPN is registered correctly on the server and that, if the server is in a separate domain, there is a full-trust relationship between the two domains.

Check SPN registration.

21005

The OpsMgr Connector could not resolve the IP for %1. The error code is %2(%3). Please verify that DNS is working correctly in your environment.

This is usually a name resolution issue. Check DNS.

21006

The OpsMgr Connector could not connect to %1:%2. The error code is %3(%4). Please verify that there is network connectivity, that the server is running and has registered its listening port, and that there are no firewalls that are blocking traffic to the destination.

This is likely a general connectivity issue. Check the firewalls and confirm that port 5723 is open.

21007

The OpsMgr Connector cannot create a mutually authenticated connection to %1 because it is not in a trusted domain.

A trust is not established. Confirm that the certificate is in place and is configured correctly.

21016

OpsMgr could not set up a communications channel to %1, and there are no failover hosts. Communication will resume when %1 is available and communication from this computer is enabled.

This usually indicates an authentication failure. Confirm that the agent was approved for monitoring and that all ports are open.

21021

No certificate could be loaded or created. This Health Service will be unable to communicate with other health services. Look for previous events in the event log for more detail.

Check the certificate.

21022

No certificate was specified. This Health Service will be unable to communicate with other health services unless those health services are in a domain that has a trust relationship with this domain. If this health service has to communicate with health services in untrusted domains, please configure a certificate.

Check the certificate.

21035

Registration of an SPN for this computer with the "%1" service class failed with error "%2." This may cause Kerberos authentication to or from this Health Service to fail.

This indicates a problem with SPN registration. Investigate SPNs for Kerberos authentication.

21036

The certificate that is specified in the registry at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings cannot be used for authentication. The error is %1(%2).

This is usually a missing or unassociated private key. Investigate the certificate. Re-import the certificate, or create a new certificate and import it.