Category Archives: Data Center Design

Last week, I was asked to present at the South West UK VMUG, alongside Nutanix to present a real world deployment story. I talked at a high level about how using Nutanix in a 6K seat VDI deployment, not only made my life easier as one of the the architects on the solution, but how it also helped my client meet their requirements easily.

I’ve spent the past 4 months on a fast paced VDI project built upon Nutanix infrastructure, hence the number of posts on this technology recently. The project is now drawing to a close and moving from ‘Project’ status to ‘BAU’. As this transition takes place, I’m tidying up notes and updating documentation. From this, you may see a few blog posts with some quick tips around Nutanix specifically with VMware vSphere architecture.

As you may or may not know, a Nutanix block ships with up to 4 nodes. The nodes are stand alone it terms of components and share only the dual power supplies in each block. Each node comes with a total of 5 network ports, as shown in the picture below.

Image courtesy of Nutanix

The IPMI port is a 10/100 ethernet network port for lights out management.

There are two 2 x 1GigE Ports and 2 x 10GigE ports. Both the 1GigE and 10GigE ports can be added to Virtual Standard Switches or a Virtual Distributed Switches in VMware. From what I have seen people tend to add the 10GigE NICs to a vSwitch (of either flavour) and configure them in an Active/Active fashion with the 2 x 1GigE ports remaining unused.

This seems to be resilient, however I discovered (whilst reading documentation, not through hardware failure) that the 2 x 10GigE ports actually reside on the same physical card, so this could be considered a single point of failure. To work around this single point of failure, I would suggest incorporating the 2 x 1GigE network ports into your vSwitch and leave them in Standby.

With this configuration, if the 10GigE card were to fail, the 1GigE cards would become active and you would not be impacted by VMware HA restarting machines in the on the remaining nodes in the cluster (Admission Control dependant) .

Yes, performance may well be impacted, however I’d strongly suggest alarms and monitoring be configured to scream if this were to happen. I would rather manually place a host into maintenance mode and evict my workloads in a controlled manner rather than have them restarted.

You are working on a large virtual desktop deployment using Active/Active datacenters, you have multiple use cases and multiple master images. With an Active/Active setup, your users have the possibility of being in DC1 one day, and DC2 the next.

So, what do you do when you have a requirement for the image to be available in case of a site failure? Nutanix make this easy for us, using protection domains and per-VM backups.

What is a protection domain?

A protection domain is a VM or group of VMs that can be backed up locally on a cluster or replicated on the same schedule to one or more clusters. Protection domains can then be associated with remote sites.

It is worth noting that protection domain names must be unique across sites and a VM can only reside in one protection domain.

A protection domain on a cluster will be in one of two modes:

Active – Manages live VMs, makes, replicates and expires snapshots

Inactive – Receives snapshots from a remote cluster

A Protection Domain manages replication via a Consistency Group.

What is a consistency group?

A Consistency Group is a subset of the VMs within the Protection Domain. All VMs within a Consistency Group will be snapshotted in a crash-consistent manner and have snapshots created at each replication interval.

What is a snapshot?

A snapshot is a read-only copy of the state and data of a VM at a point in time. Snapshots for a VM are crash consistent. This means that the VMDK on disk images are consistent with a single point in time. The snapshot represents the on disk data as if the VM crashed. These snapshots are not however application consistent meaning the application data is not quiesced at the time of the snapshot. With some server workloads this could cause us some issues for recovery, however for our VDI master image this is not an issue – the master image is likely going to be powered off the majority of the time. Snapshots are copied asynchronously from one cluster to another.

What are per VM Backups?

A per VM backup give the ability to designate certain VMs for backup to a different site, such as a group of desktop master images. Not all legacy storage vendors offer the ability to replicate at a VM level, normally an entire LUN or Volume replicated at a single time.

Where am I going with this?

There are many solutions to replicate data, however Nutanix provides this capability, albeit at a small cost, within its platform. No additional components are necessary and it even has an SRM plugin. The key feature is Nutanix integrates with vSphere to make this is a seamless process.

“Why do I need to bother backing up the config file of my vCNS Manager, can’t I just snapshot it?”

It’s a good question, and one that involved a little lab testing to play around with.

If you were to snapshot your vCNS manager,which does work from testing in my lab (albeit limited functional testing), then you are able to restore the vCNS manager from snapshot fairly efficiently and quickly.

The questions I then thought of were:

When is the backup window? (if there is one)

How often would a vCNS snapshot be taken?

How busy is the vCNS manager?

Does a backup restore involve change control or other teams?

The reason for these questions in my head were simple.

If a vCNS manager was in a relatively busy vCloud environment deploying a number of Edge devices daily, then yes they would continue to run if the manager were to fail, but if the vCNS manager were only scheduled to have a daily snapshot during a nightly backup window, then there could be an issue with unknown Edge devices after the restore from backup.

The official supported method of backing up vCNS manager is to schedule a backup from the manager itself, to backup the configuration to an FTP/SFTP site.

If the vCNS manager were to fail, you would simply deploy a new vCNS manager (normally within minutes) then re-apply the last saved configuration and get back up and running fairly quickly. Yes, you could argue that if only a single backup was taken daily then we would be in the same boat as with a snapshot, however, It’s much easier and more manageable, in my opinion, to set perhaps an hourly backup (in busy environments) and perhaps only keep a days worth of backup files.

After some debate with my client, my recommendation was to ‘keep it simple’. This meant, stay within the realms of vendor recommendation and support. Configure an hourly backup and keep a single days worth of backups. In the case of a failed and unrecoverable vCNS manger, deploy a new appliance and restore the configuration.

I’d be interested to hear any feedback from others as to what they do in their environments or in fact recommend to others.

Understand what logical performance services are provided by VMware solutions

VMware have a number of performance enhancers in the vSphere, some of which are available in all licence versions, some however require a certain licence level to make the features available.

Memory

Transparent Page Sharing – Shares identical memory pages across multiple VMs. This is enabled by default. Consideration should be given to try and place similar workloads on the same hosts to gain maximum benefit.

Memory Ballooning – Controls a balloon driver which is running inside each VM. When the physical host runs out of memory it instructs the driver to inflate by allocating inactive physical pages. The ESXi host can uses these pages to fulfill the demand from other VMs.

Memory Compression – Prior to swapping, memory pages out to physical disks. The ESXi server starts to compress pages. Compared to swapping, compression can improve the overall performance in an memory over commitment scenario.

Swapping – As the last resort, ESXi will start to swap pages out to physical disk.

Storage I/O Control (SIOC) – was introduced in vSphere 4.1 and allows for cluster wide control of disk resources. The primary aim is to prevent a single VM on a single ESX host from hogging all the I/O bandwidth to a shared datastore. An example could be a low priority VM which runs a data mining type application impacting the performance of other more important business VMs sharing the same datastore.

vSphere Storage API’s – Storage Awareness (VASA) – VASA is a set of APIs that permits storage arrays to integrate with vCenter for management functionality.

Networking

Network IO Control (NIOC) – When network I/O control is enabled, distributed switch traffic is divided into the following predefined network resource pools: Fault Tolerance traffic, iSCSI traffic, vMotion traffic, management traffic, vSphere Replication (VR) traffic, NFS traffic, and virtual machine traffic. You can control the bandwidth each network resource pool is given by setting the physical adapter shares and host limit for each network resource pool.

List the key performance indicators for resource utilisation

According to ITIL, a Key Performance Indicator (KPI) is used to assess if a defined service is running according to expectations. The exact definition of the KPIs differs depending on the area. This objective is about server performance which is typically assessed using the following KPIs: Processor, Memory, Disk, and Network.

On Friday morning I sat the VCAP5-DCD exam and I’m delighted to say I passed! If you are a regular visitor, you’ll notice that I have started a VCAP-DCD study guide section which hasn’t been updated in a while. I wont bore you with why, however I do have all my study notes, which I will collate and continue posting alongside the relevant objectives.

Usual Disclaimer: I agreed to the NDA prior to sitting the exam so I will not divulge any exam specifics, so please don’t ask!

The exam is tough, as is the common theme with VCAP exams, and test every area of a vSphere deployment. My biggest piece of advise would be get to know the blueprint inside out, it should become your friend, and you should be comfortable with everything in it!

The multiple choice questions are more complex and tougher than those set out in the VCP exams, as you would expect being the advanced certification, however I believe these questions are very fair. The drag and drop style questions are tricky too and require some working out, don’t whizz through these questions, take your time, as I would image these are some big hitters on the overall exam scoring (I don’t know this, I’m just assuming). The Visio style diagram questions are again tough, (see a pattern emerging here?) however contain all the information you need and more to successfully answer the question.

In no particular order, here is what I would recommend to any people planning to sit the exam:

Don’t panic about time, keep calm and work at a consistent pace and you will be fine

Take as many laminate sheets as permitted, I drew my designs on here before doing them on screen so I knew what I wanted to place where, as the tool can be quite clunky

Aside from official VMware documentation, there are a few other resources I would highly recommend to use for study material, they can be found on my VCAP-DCD study guide page.

Last piece of advise would be to draw out some practise designs. Take your client or internal designs, change them and draw them out. Don’t just concentrate on hosts and clusters, include storage and networks too. Use multiple tiers of storage, multiple protocols, and throw in some DR for good measure.

Originally, for my VCAP-DTD study I used some Magic Whiteboard from Amazon, however it’s quite expensive and I went through the roll quite quickly. I’ve since purchased a clear glass dry-erase board and put it on the wall in my home office, which is much more convenient and in my opinion an essential skill that needs to be sharp for the exam!

If you are sitting the exam soon, please keep checking back for updates as I continue to post my notes against each blueprint objective and good luck! What’s next? VCAP-DCA of course!

This was covered off in the previous Objective, however, as a reminder

Availability – The ability of a system or service to perform it’s required function when required. It is usually calculated as a percentage.

Manageability – The expense of running a system. If in a large enterprise the system is managed by a small team, the operation cost can therefore be low.

Performance – The measure of what is delivered by the system. This is usually measured against known standards. Recoverability – The ability to return a system to a working state after a failure or repair.

Security – The process of ensuring the service is used in the appropriate manner.

Understand what logical availability services are provided by VMware solutions.

The two primary availability services in vSphere are High Availability (HA) and Fault Tolerance (FT). Studying for this exam, you should be understand the differences in these features, however at a very high level: HA – Can minimise downtime by restarting VMs in case of a hardware failure FT – Provides continues availability for a VM by making a secondary copy of the VM on another physical host. To gain a better understanding of VMware’s HA, (as well as DRS, Storage DRS and Stretched Clusters) the VMware vSphere 5.1 Clustering Deep Dive by Frank Denneman and Duncan Epping is a MUST! The VMware vSphere Availability Guide is also a MUST read. Fault Tolerance, whist no doubt is a great technology, it does have limitations, which are discussed in the Availability Guide. I rarely see a business case for FT, in most cases HA is good enough.

Availability – The ability of a system or service to perform it’s required function when required. It is usually calculated as a percentage.

Manageability – The expense of running a system. If in a large enterprise the system is managed by a small team, the operation cost can therefore be low.

Performance – The measure of what is delivered by the system. This is usually measured against known standards. Recoverability – The ability to return a system to a working state after a failure or repair.

Security – The process of ensuring the service is used in the appropriate manner.

Describe the concept of redundancy and the risks associated with single points of failure.

A single point of failure is a system component, that if it fails, will then cause the entire system to fail because of it. For example, in a vSphere world, if we have a virtual switch with a single physical NIC uplink and this uplink fails, the virtual switch will fail as a result. These components can be bolstered by adding redundancy, in the above example we could add redundancy to the virtual switch by adding a second physical uplink, therefore if one uplink fails traffic could continue to pass on the second uplink. This spreads out to multiple areas of a vSphere design, hosts in clusters, components in hosts and stretching out to the wider infrastructure, with multiple physical switches, load balancers etc etc.

Differentiate Business Continuity and Disaster Recovery concepts.

Business Continuity is focussing on avoiding or mitigating the impact of risk, therefore is a proactive approach.

Disaster Recovery is focussing on the recovery of a system/service after an outage, therefore is a reactive approach.

VMware offer a free DR/BC Fundamentals training course through MyLearn. Click the following link to register

As I’ve already mentioned previously on this blog, and as you’ll probably have realised if you’ve started reading my VCAP-DCD study guides I’m due to sit the VCAP5-DCD exam in the next few weeks. Due to work commitments, my study has taken a nose dive, however I’m still planning on posting all my study notes covering off the objectives listed in the blueprint over on my VCAP-DCD study guides page.

I’ve not used a cert guide to prepare for an exam before, normally I will study the exam blueprint, and work through the official vendor documentation whilst reading related books, so wasn’t sure what to expect before reading this guide.

Firstly let me start by saying that the author obviously has worked on a number of vSphere design projects and is able to backup methodologies discussed in the book with real life scenarios, this for me was one of the highlights of this book.

I found the ‘Do I know this quizzes’ at the beginning of each chapter a good way to judge how well I knew the topics the chapter would cover before we started and this helped give an indication of whether further reading may be required. Alongside this, at then end of each chapter the author lays out some design scenarios for you to complete. Having completed the VCAP-DTD recently, I know how important it is to practice the scenarios so you can quickly pick out requirements and translate them into a design.

Overall the book reads very well, flows easily, covering off the objectives on the exam blueprint. I’d recommend the book to anyone looking to sit the VCAP5-DCD exam.

Identify basic service dependencies for infrastructure and application services

Service dependencies come in many forms within a vSphere infrastructure design. Services rely on objects such as DNS, NTP, Active Directory etc. What devices are communicating together? What ports are they communicating on? Which processes make up these services?

VMware did have a product to assist in this, VMware vCenter Application Discovery manager, however this has now gone EOL, and unless you have already purchased it, you wont be able to get your hands on it. The current state analysis that should have already been completed at this point should help here, in particular in identifying the applications that will be migrated. It will then be a manual process to discover and document these dependencies.

I found a good WIKI from ServiceNow which delves deeper into application dependency mapping. This article explains how relationships are defined using the following:

Runs on::Runs

Depends on::Used by

Hosted on ::Hosts

Virtualised by::Virtualises

Contains::Contained by

IP Connection::IP Connection

They also delve deeper into upstream and downstream relationships, I’d highly recommend giving this page some attention.

Document and reference your findings to ensure every relationship and dependency is covered and accounted for in the design.