I’m fairly new to SRM, but even so this one seemed like a real head-scratcher! If you happen to be using CA signed certificates on your “protected site” vCenter and “recovery site” vCenter servers, when you come to linking the two SRM sites you encounter SSLHandShake errors – basically SRM assumes you want to use certificates for authentication because you’re using signed certificates. If you use the default self-signed certificates, SRM will default to using password authentication (see SRM Authentication). Where the process fails is during the “configure connection” stage, if either one of your vCenter servers does not have CA signed and the other does (throws an error that they are using different authentication methods) or that you are using self-signed certificates for either SRM installation (throws an error that the certificate or CA could not be trusted).

SRM server ‘vc-02.definit.local’ cannot do a pair operation. The reason is: Local and remote servers are using different authentication methods.

There are different schools of thought as to whether you should have SSH enabled on your hosts. VMware recommend it is disabled. With SSH disabled there is no possibility of attack, so that’s the “most secure” option. Of course in the real world there’s a balance between “most secure” and “usability” (e.g. the most secure host is powered off and physically isolated from the network, but you can’t run any workloads ). My preferred route is to have it enabled but locked down.

Note: VMware use the term “ESXi Shell”, most of us would term it “SSH” – the two are used interchangeably in this article although there is a slight difference. You can have the ESXi Shell enabled but SSH disabled – this means you can access the shell via the DCUI. For the sake of this article assume ESXi Shell and SSH are the same. [Read more…]

I learned something new today: SCOM 2007 R2 certificate based communications not only checks the validity of the certificate you use, but also the CA that issued it…let me expand:

Like many organisations there is a root CA (we’ll call it ROOTCA01), and then a subordinate CA (we’ll call that SUBCA01). OPSMGM01 has a certificate to identify itself and has certificates for ROOTCA01 and SUBCA01 in it’s Trusted Root Certificate Authorities.

The certificate to secure the connection between OpsMgr Gateway (OPSGW01) and the OpsMgr Management Server (OPSMGM01) is issued by SUBCA01 and is installed on OPSGW01, and to validate the certificate chain SUBCA01’s certificate is also installed in the Trusted Root Certification Authorities. Opening OPSGW01’s certificate and examining the Certificate Path tab shows the certificate is valid all the way up to the issuing CA – SUBCA01.

The connection will not work – OPSGW01 logs the following events:

Log Name: Operations Manager
Source: OpsMgr Connector
Date: 05/01/2012 10:18:28
Event ID: 21016
Level: Error
Computer: opsgw01.definit.co.uk
Description: OpsMgr was unable to set up a communications channel to opsmgm01.definit.co.uk and there are no failover hosts. Communication will resume when opsmgm01.definit.co.uk is available and communication from this computer is allowed.

Log Name: Operations Manager
Source: OpsMgr Connector
Date: 05/01/2012 10:18:25
Event ID: 20070
Level: Error
Computer: opsgw01.definit.co.uk
Description: The OpsMgr Connector connected to opsmgm01.definit.co.uk, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

Log Name: Operations Manager
Source: OpsMgr Connector
Date: 05/01/2012 10:18:24
Event ID: 20067
Level: Warning
Computer: opsgw01.definit.co.uk
Description: A device at IP xxx.xxx.xxx.xxx:5723 attempted to connect but the certificate presented by the device was invalid. The connection from the device has been rejected. The failure code on the certificate was 0x800B0109 (A certificate chain processed, but terminated in a root certificate which is not trusted by the trust provider.).

It’s the last event that led me to check the certificate chain for the SUBCA01 certificate, which was installed and trusted but did not validate up the chain to ROOTCA01. Installing the ROOTCA01 certificate resolved this issue.