Browsed byAuthor: Daniel

vSAN 6.6 it’s 6th generation of the product and there are more than 20+ new features and enhancements in this release, such as:

Native encryption for data-at-rest

Compliance certifications

Resilient management independent of vCenter

Degraded Disk Handling v2.0 (DDHv2)

Smart repairs and enhanced rebalancing

Intelligent rebuilds using partial repairs

Certified file service & data protection solutions

Stretched clusters with local failure protection

Site affinity for stretched clusters

1-click witness change for Stretched Cluster

vSAN Management Pack for vRealize

Enhanced vSAN SDK and PowerCLI

Simple networking with Unicast

vSAN Cloud Analytics with real-time support notification and recommendations

vSAN Config Assist with 1-click hardware lifecycle management

Extended vSAN Health Services

vSAN Easy Install with 1-click fixes

Up to 50% greater IOPS for all-flash with optimized checksum and dedupe

Support for new next-gen workloads

vSAN for Photon in Photon Platform 1.1

Day 0 support for latest flash technologies

Expanded caching tier choice

Docker Volume Driver 1.1

… ok now lets review main enhancements:

vSAN 6.6 introduces the industry’s first native HCI security solution. vSAN will now offer data-at-rest encryption that is completely hardware-agnostic. No more concern about someone walking off with a drive or breaking in to a less-secure, edge IT location and stealing hardware. Encryption is applied at the cluster level, and any data written to a vSAN storage device, both at the cache layer and persistent layer can now be fully encrypted. And vSAN 6.6 supports 2-factor authentication, including SecurID and CAC.

Certified file services and data protection solutions are available from 3rd party partners in the VMware Ready for vSAN Program to enable customers to extend and complement their vSAN environment with proven, industry-leading solutions. These solutions provide customers with detailed guidance on how to complement vSAN. (EMC NetWorker is avaialble today with new solutions coming on soon)

vSAN stretched cluster was released in Q3’15 to provide an Active-Active solution. vSAN 6.6 adds a major new capability that will deliver a highly-available stretched cluster that addresses the highest resiliency requirements of data centers. vSAN 6.6 adds support for local failure protection that can provide resiliency against both site failures and local component failures.

PowerCLI Updates: Full featured vSAN PowerCLI cmdlets enable full automation that includes all the latest features. SDK/API updates also enable enterprise-class automation that brings cloud management flexibility to storage by supporting REST APIs.

VMware vRealize Operations Management Pack for vSAN released recently, provides customers with native integration for simplified management and monitoring. The vSAN management pack is specifically designed to accelerate time to production with vSAN, optimize application performance for workloads running on vSAN and provide unified management for the Software Defined Datacenter (SDDC). It provides additional options for monitoring, managing and troubleshooting vSAN along with the end-to-end infrastructure solutions.

Finally, vSAN 6.6 is well suited for next-generation applications. Performance improvements, especially when combined with new flash technologies for write-intensive applications, enable vSAN to address more emerging applications like Big Data. The vSAN team has also tested and released numerous reference architectures for these types of solutions, including Big Data, Splunk and InterSystems Cache.

Recently we had some strange problems with our 6.5 lab vCenter (Windows version with MSSQL Server db), which frequently crashed. After some digging in vpxd logs it seem to be related to vc db permissions:

During deployment of Cisco proxy appliance, we discovered a problem. According to cisco to resolve this problem “Net.ReversePathFwdCheckPromisc” should be set to “1” on ESX’s.

The question is – do you know any negative effects which such change could cause. We believe that there must be a reason why by default this option is set to 0 ? That’s why I decided to figure our what it is used for.

After some research I was able to find answer:

Setting – > Net.ReversePathFwdCheckPromisc = 1 — > this is when you are expecting the reverse filters to filter the mirrored packets, to prevent multicast packets getting duplicated.

Note: If the value of the Net.ReversePathFwdCheckPromisc configuration option is changed when the ESXi instance is running, you need to enable or re-enable the promiscuous mode for the change in the configuration to take effect.

The reason you would use promiscuous mode depends on the requirement and configuration. Please check the below KB Article:

This option is not enabled by default because we are not aware of the vSwitch configuration and can’t predict what it could be as it has configurable options.

VMware does not advise to enable this option if we do not have a use case scenario with teamed uplinks and have monitoring software running on the VMs ideally. As When promiscuous mode is enabled at the port group level, objects defined within that port group have the option of receiving all incoming traffic on the vSwitch. Interfaces and virtual machines within the port group will be able to see all traffic passing on the vSwitch causing VM performance impact.

Should the ESX server be rebooted for this change to take effect: answer is – > Yes, and Yes you can enable this option with the VMs running on the existing portgroup.

This is a mini article to start our Q&A set, a set of not easy to find answer real life questions 😉
Recently I received a question-related to advanced settings SAP app on vSphere platform:
“One of our customer ask us to set the following option to their virtual system: Misc.GuestLibAllowHostInfo This is according to SAP note: 1606643 where SAP requires reconfigure virtual system default configuration. I can’t find details information, which host data would be exposed to virtual system. Could you please point me to documentation or describe which information is being transferred from HOST to virtual systems?“

After some research I was able to find answer :

“Misc.GuestLibAllowHostInfo” and “tools.guestlib.enableHostInfo” these configurations if enabled allow the guest OS to access some of the ESXi host configurations, mainly performance metrics e.g. how many CPU cores the host has, their utilization and contention etc. There is no confidential information from other customers which would be visible, however, it may give the user of those SAP VMs access to performance/resource information which you may not want to share.

The following document outlines the effect of the changes as I have described above.

I believe the “might use the information to perform further attacks on the host” could only apply to other vulnerabilities which may exist for the particular hardware information that the guestOS can gather from the ESXi host.
Other than that I am not sure there is any other concern to worry about.

In last part of this small series, we discussed theoretical background about components and technology related for adding ESX host to windows AD environment. Now it is time to describe troubleshooting options and some real life problems with solutions.

Let’s start from dividing all ESXi/Likewise issues into categories:

Domain Join Failures

Here are most often reasons that an attempt to join a domain fails:

The user name or password of the account used to join the domain is incorrect.

The name of the domain is mistyped.

The name of the OU is mistyped.

The local hostname is invalid.

The domain controller is unreachable from the client because of a firewall or because the NTP service is not running on the domain controller.

make sure the nameserver entry in /etc/resolv.conf contains the IP address of a DNS server that can resolve the name of the domain you are trying to join.

Make Sure nsswitch.conf Is Configured to Check DNS for Host Names

The /etc/nsswitch.conf file must contains the following line:

hosts: files dns

Ensure that DNS Queries Are Not Using the Wrong Network Interface Card

If the ESX host is multi-homed, the DNS queries might be going out the wrong network interface card. Temporarily disable all the NICs except for the card on the same subnet as your domain controller or DNS server and then test DNS lookups to the AD domain. If this works, re-enable all the NICs and edit the local or network routing tables so that the AD domain controllers are accessible from the host.

Determine Whether the DNS Server Is Configured to Return SRV Records

Your DNS server must be set to return SRV records so the domain controller can be located. It is common for non-Windows (bind) DNS servers to not be configured to return SRV records.

Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp. ADdomainToJoin.com

Make Sure that the Global Catalog Is Accessible

The global catalog for Active Directory must be accessible. Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp.gc._msdcs. ADrootDomain.com

From the list of IP addresses in the results, choose one or more addresses and test whether they are accessible on Port 3268 by using telnet.

Verify that the Client Can Connect to the Domain on Port 123

Windows time service must be running on the domain controller.

On a Linux computer, run the following command as root:

ntpdate -d -u DC_hostname

Log-in/Authentication issues

Make Sure You Are Joined to the Domain

Check ‘lw-lsa get-status’

Clear the Cache

Clear the cache to ensure that the client computer recognizes the user’s ID.

# ad-cache –delete-all

Clear the Likewise Kerberos cache to make sure there is not an issue. Execute the following command at the shell prompt with the user account that you are troubleshooting:

~#kdestroy

Check the Status of the Likewise Authentication Daemon

#/etc/init.d/lsassd status

Check Communication between the Likewise Daemon and AD

verify that the you can ping DC from ESX host.

Make Sure the AD Authentication Provider Is Running

# lw-lsa get-status

If the result will not include the AD authentication provider or will indicate that it is offline restart the authentication daemon

Check whether you can log on with SSH by executing the following command:

ssh DOMAIN\\username@localhost

Lsassd crash due to various reasons such as during trust enumeration etc.

analyze the lsassd,netlogond,lwiod logs, see where exactly where likewise daemon is crashing.

look into the hostd logs and tcpdump to get more info

Kerberos related issues

start to look into the packet capture (both sites esxi and ad) to see if we’re getting proper TGT and TGS.

//can be related to Kerberos cache so in this case empty the Kerberos cache using mentioned ‘kdestory’ command.

Hostd crash in Likewise code

Gather full log bundle and engage VMware GSS

Windows AD server related issues

Gather guest OS logs and engage MS Support.

Ok., so now we have in one place all troubleshooting options and methodology, now it is time for real life story experience based on one of my last service requests: Customer is unable to log in using Active Directory credentials. It shows invalid credentials even though “Authentication Services” shows that host is joined into domain correct domain. The issue is seen on most of the hosts within the environment. Only 2 hosts do not suffer from the problem – cannot find any difference in configuration. Customer running latest 6.0 build: 4600944

Some other symptoms observed during troubleshooting issue step by step:

We foud that on problematic ESXi hosts IPv6 communication was disabled but DC still using IPv6 in communication after couple test we confirm that after enabling IPv6 on ESXi or totally disabling it at DC site:

finally, there is no error with adding a host to the domain and DC authentication.

To clear more this whole situation we decided to perform additional investigation with VMware Support. GSS confirmed that they located the issue:

“…with the newer versions (vSphere 6) of ESXi in case it receives kdc in IPv6 format. In that situation the host will try to connect with IPv6. In case host has IPv6 disabled it will fail to join the domain “

Last week I had to troubleshoot strange issue related to Active Directory integration with ESXI (6.0 version), this was motivation to prepare small (two articles) series about ESXi / Likewise integration and troubleshooting based on my latest experience.

VMware use Powerbroker Identity Services (Formerly known as Likewise) for adding ESX host to windows AD environment. To begin as usual is good to have some theoretical background about related components and technology and we describe it all in this part.

Below some of the basics of PAM and Kerberos:

PAM (Pluggable authentication module) – It’s a mechanism to integrate multiple low-level application schemes into a high-level APIs. All the application programs like hostd, dcui etc use PAM for creating users and authenticating them. They are referred to system-auth file which in turn refers to /etc/security/login.map. login.map maps ‘vpxa’ user to system-auth-local and all other users are mapped to system-auth-generic. PAM on its own can’t implement Kerberos ; It’s not possible for a PAM module to request a Kerberos service ticket (TGS) from a Kerberos key distribution center (KDC).
Kerberos – Kerberos protocol is designed to provide reliable authentication over open and insecure networks where communications between the hosts belonging to it may be intercepted. So we can say Kerberos is an authentication protocol for trusted hosts on untrusted networks.

If you interested in more deep knowledge on this topic, take a look at this Kerberos tutorial: http://www.kerberos.org/software/tutorial.html

Before we describe communication with ESXi lets gathers together all Likewise components:
• Lsassd – The Likewise authentication daemon handles authentication, authorization, caching, and idmap lookups,
• Netlogond – Detects the optimal domain controller and global catalog and caches the data,
• Lwiod – The Likewise input-output service. It communicates over SMB with SMB servers,
• Caches – To maintain the current state and to improve performance, the Likewise agent caches information in several files, all of which are in /etc/likewise/db/
• lsass-adcache.filedb – Cache managed by the AD authentication provider,
• netlogon-cache.filedb – Domain controller affinity cache, managed by netlogond,
• pstore.filedb – Repository storing the join state and machine password.

OK, now it’s time to consider how Likewise extends Kerberos authentication to ESXi:

1. User logs in to ESX (c# or web client),
2. Username and password are sent to PAM,
3. pam_lsass.so library communicates with the Lsassd,
4. from username and password Lsassd generates a secret key,
5. using the secret key Lsassd request a TGT, from AD’s KDC,
6. The KDC verifies the secret key and then grants the ESXi Host a TGT,
7. ESXi host and the KDC exchange messages to authenticate the client,
8. Lsassd can use the TGT request service tickets for other services such as ssh.

To clarify more lets discuss important algoritms (netlogon) related to this process to address common questions:

1. How Netlogond finds the best DC (prioritization) ?

Likewise Netlogon obtains a list of candidate Domain Controllers using DNS. The algorithm for doing this is based on the algorithm used in Windows Netlogon. Each candidate Domain Controller which matches the site criteria is queried with a CLDAP request for the Netlogon attribute. The time to respond for each Domain Controller is stored as PingTime in the DomainControllerInfo output parameter. The Domain Controller with the lowest PingTime is returned to the caller.

2. Prefered DC

Netlogon attempts to find the domain controller which responds the quickest to CLDAP pings with a preference for domain controllers in the same site. The algorithm is rather complex 😉
If the request includes a site, then the query order is:

a) Preferred domain controller plugin with the requested site
b) DNS with the requested site

Last part of this article will discuss important packets in tcpdump – very important in case of troubleshooting problems with joining ESXi to domain :
1. CLDAP – Usage of CLDAP packets depends upon the attribute:

If attribute = netlogon -> These CLDAP pings are used by netlogond to verify the aliveness of the domain controller and also check whether the domain controller matches a specific set of requirements. netlogon version(NtVer) etc.
If attribute = time -> These’re used for selecting the nearest DC by netlogon during domain controller discovery phase.
2. ARP – Address resolution protocol is used for resolution of network layer address into link layer address i.e. IP address to MAC address. These’re common packets not specific to AD.

3. DNS – DNS queries related to srv records. Microsoft decided to use SRV records as a key part of the procedure whereby a client finds a domain controller. So where do these records come from? They are registered with DNS by the NetLogon service of a domain controller when it starts. There are actually quite a few of these records, but for right now let’s just look at two of them, the ones that have to do with domain controllers. They are in the following formats:
_ldap._tcp.dc._msdcs.dnsdomainname _ldap._tcp.sitename._sites.dc._msdcs.dnsdomainname
4. KRB5 : For Krb5 you would see four packets viz. AS-REQ, AS-REP, TGS-REQ, TGS-REP:
• AS_REQ is the initial user authentication request (i.e. made with kinit) This message is directed to the KDC component known as Authentication Server (AS);
• AS_REP is the reply of the Authentication Server to the previous request. Basically it contains the TGT (encrypted using the TGS secret key) and the session key (encrypted using the secret key of the requesting user);
• TGS_REQ is the request from the client to the Ticket Granting Server (TGS) for a service ticket. This packet includes the TGT obtained from the previous message and an authenticator generated by the client and encrypted with the session key;
• TGS_REP is the reply of the Ticket Granting Server to the previous request. Located inside is the requested service ticket (encrypted with the secret key of the service) and a service session key generated by TGS and encrypted using the previous session key generated by the AS;

5. LDAP : A standards-based protocol that is used for communication between directory clients and a directory service. LDAP is the primary directory access protocol for Active Directory. LDAP searches are the most common LDAP operations that are performed against an Active Directory domain controller. An LDAP search retrieves information about all objects within a specific scope that have certain characteristics, for example, the telephone number of every person in a department.
6. NBNS – NBNS serves much the same purpose as DNS does: translate human-readable names to IP addresses

9. TCP- Network traffic who uses TCP as transmission protocol viz. Kerberos,ssh,https,ldap etc.
Processes transmit data by calling on the TCP and passing buffers of data as arguments. The TCP packages the data from these buffers into segments and calls on the internet module [e.g. IP] to transmit each segment to the destination TCP.

10. SMB – Server Message Block,also known as Common Internet File Systems(CIFS) operates as application layer network protocol mainly used for providing shared access to files,printers,serial port, and miscellaneous communications between nodes on a network. It also provides an authenticated inter-process communication mechanism.
Now we are prepared for troubleshooting ! Stay tuned for next part 🙂

Recently I received quite interesting question – what is the supported maximum quantity for tags in vCenter 6.0U2 ?

Malignant author of the question is a good friend of mine and VMware administrator in one person. He ssked about tags limit because he want to use them to provide more information about each of its production VM’s – roughly speaking need to create about 20000 tags.

I thought ok., give me couple seconds to verify this, and looked fast in vmware configuration maxims …. couple minuntes later it was clear that this is not a easy question 😉

Furthermore after some additional research (no clear statement in official documentation) we decide to perform tests in lab environment !

Couple days ago I hit strange issue with vcenter 6.x appliance update process. Attempting to upgrade vCenter server 6.0 U1 to 6U2, encounter following error: Installation of component VMware Postgres 9.3.9.0 Installer failed with error code ‘3010’. The installation rolled back after clicking at OK button. I was able to find following KB 2144389 with a similar issue but the workaround didn’t work. After rolling back to a snapshot before the update, we also tried to update vCenter to 6U 1b with the same result.

Try to determine if this an issue with installing the Postgres service only or if this is an issue with another component holding a java process.
I Stop ALL Vmware services and set them to manual.
II Restart the vCenter and ensure no services are restarted.
III Open the installation and navigate to the postgres installtion msi and attempt to just install this component.

Today my colleague (vmware administrator) asked me a small favour – help to perform RCA (root cause analyze) – related to one of production VM that had recently a problem. VM for some reason was migrated (collegue stands that it happened without administrative intervention) to other ESXi host that do not have proper network config for this vm – this caused outage for whole system. I asked about issue time and we went deeper into logs in order to find out what exacly happened.
We started from VM logs called vmware.log.

According to VMware KB: 2097159 „VMX has left the building: 0” – is an informational message and is caused by powering off a Virtual Machine or performing a vMotion of the Virtual Machine to a new host. It can be safely ignored so we had a first clue related to vm migration.
Next we moved to fdm.log (using time stamps form vmware.log) and surprise surprise – VM for some reason was powered off 1h before issue occur:

So we confirmed that shortly before issue occured VM was powered off for snapshot consolidation – at this point we assumed that this might be related to general issue but we decided to verify vobd log (less verbose than vmkernel and give good view about esxi healt in case of storage and networking) :