Homemade and just a little wrong

A lot of readers may be aware that storage vMotion is a awesome feature that I love. It had one nasty side effect it reset change block tracking (CBT). This is a huge problem for any backup product that uses CBT (Veeam, anything else that does an image level backup). It means that after a storage vMotion you now have to do a full backup. It’s painful and wasteful. It means that most enterprise environments refuse to use Automation storage DRS moves (one of my favorite features) because of it’s impact on backups. Well the pain is now over… and has been for a while I just missed it :). If you look at the release notes for ESXi 5.5 U2 you will find the following note:

I wish I told us more about the change or why it is no longer reset but I guess I’ll accept it as is. The previous work around was don’t use storage vMotion or do a full backup after… which is not a work around but an effect. Either way enjoy moving the world again.

Like this:

I have started up the Lunch and Learn sessions again. We have sign up’s and a schedule. I ask that you sign up so I can notify you if a session has been canceled. Your information will not be shared with anyone at all. You can sign up here:

Like this:

Disclaimer: This is a rant about technology. I will return you to your normal technical posts soon.

Recently I attended the Columbus Symphony with my daughter. She has an interest in music and I want to encourage it. As I was sitting in the theater watching the performers a few things struck me:

Redundancy

Scale out

Unity

The unknown problem

The role of the conductor

Redundancy

As you watch the symphony play there are many different instruments each with individual functions. For each sheet music there are two performers while playing if a page on the sheet music needs to be turned one of the two performers stops playing and changes the page. At the next page turn the other performer takes his turn. This automated and orderly way to perform offline duties reminds me of infrastructure. We are constantly looking to remove single points of failure. We want to create redundancy so if maintenance or failure occurs the whole performance is not effected (page turning or broken strings). Failure does not happen often because each performer should do regular maintenance on their individual instruments. This redundancy is critical to a well running infrastructure. We must be able to perform regular maintenance and be redundant while a failure occurs. During periods when the pages are being switches it’s possible our infrastructure will not be at full strength this is where scale out comes into play.

Scale out

It can be easy to see how two people playing an instrument is not enough to provide the required volume and power for the performance. This is where we see the principle of scale out in play. I can add as many two person violin groups as I need to produce the required volume. Adding more violins should be possible to meet the demands of the location or song. The challenge with scaling out is three-fold:

Expertise required

Management demands of scaling out

Balancing the needs as a whole

Expertise

In order to fill more chairs I need more highly trained performers. In infrastructure terms I need more specialized devices that are compatible. I cannot simply plug a kazoo player into a violin and expect beautiful music, compatibility and skills are required. This infrastructure tenant applies to all aspects of infrastructure.

Management demands of scaling out

As I scale out I quickly find it hard to manage so many people, simply put unless I can manage 100 people exactly as I manage 1 there is a cost associated. This is where scale out computer solutions have the advantage, assuming you buy a solution from the same vendor we hope they can be managed as one entity. I have found that vendors solutions don’t seem to have this level of intelligence. VMware has brought us vSphere which does abstract and pool compute resources. It seems that a lot of storage and networking vendors have not discovered the idea behind scale out without making it hard to manage.

Balancing the needs as a whole

Adding more violins increases the volume of my violins but may drown out all other instruments except the drums. This is not a desired effect. Adding more violins has the potential effect of requiring more instruments to meet the newly required scale. This is a very hard thing to balance. In storage systems we need balancing acts between iops, cache sizes, algorithms and spinning disks. In networking we see total throughput, hair pinning and redundant architectures all effecting our scale up. In compute we have the introduction of server-side flash and cache with the needs of the application as a whole. One cannot simply increase one metric without looking at its effect upon the whole.

Unity

In the symphony they all have a common goal. They know that goal from the start, they have practiced and trained for that goal. (QA testing and programming logic) They require that all components do their job in unity to correctly achieve the goal. If one section of instruments is a few seconds off from the rest the performance is ruined (at least for those that can tell the difference). Their unity and timing is critical. Humans are prone to mistakes and they will happen. Performers will get out of sync and need to catch up. Infrastructure is the same way. If my networking chooses to delay a message for a few seconds everything else is effected. We need all the components to work perfectly every time. This is harder than it sounds. Computers are programs and cannot account for anything that was not provided in their program.

The Unknown Problem

Here is the big problem. It’s what we don’t know that will kill us. In religion there is the concept of absolute truth and relative truth. Absolute truth is truth based upon all the facts. The concept is that if we understood everything we could always make the right choice. We would be able to be perfect and create without failures. Religion is largely based upon following a being that has absolute truth. Relative truth is truth based upon our current understanding until proven incorrect, think world is flat… now its roundish. Relative truth is the world that we deal with each day. In the performance assumptions can be made about the required number of performers based on the size of the hall or past experience with the hall. Best practices around performance sizing can be made. The assumptions are just that assumptions they cannot take into account all possible eventualities. Disaster may strike like a floor or roof caving in or something simple like an accident outside the theater will cause an ambulances noise to ring for half the performance. These factors are unknowns and are common. When writing code in college I often had my wife test out the application. It normally took her about 15 seconds to do something totally unexpected (by me) and break everything. It was so frustrating. Users and applications will do the unexpected. There are a lot of unknowns like effect of a lightning induced power failure on your storage system (it’s not good trust me). The unknown requires that we keep an open mind and adjust as needed. All IT is software defined. It does not matter if it’s on a chip or running in memory it’s software defined. Firmware is software that runs on hardware. The critical concept for me from software defined IT is the ability to have intelligence and agility. I love the story about the last google outage. We had a bug introduced into our production networking and it was detected and removed automatically by the software. Can anyone else say awe-some followed up by I am afraid of Sky-net. (For non-US readers Sky-net is a A.I. from the movie series terminator and tried to kill all humans) This is intelligent and agile. The latest movement to define in software should provide quicker redundant fixes to the unknown problem.

The role of the conductor

The conductors role is to unify the performers, set a tempo, execute clear preparations and beats, and to listen critically and shape the sound of the ensemble (Wikipedia). So he is the big boss man whole keeps the whole ship running perfectly. He could be called the architect but it’s simply untrue. The architect is a person who works with relative truth, old truth and observed truth. In order to understand my problem with the architect being called the conductor I have to illustrate another challenge: the music changes. The symphony plays a song then changes to the next song. Their roles and goals change. Violins may have had a heavy role in the last song and a very minor role in this song. Making that scale out of performers not required. The game is constantly changing each with their own challenges. Infrastructure has a much larger problem: there is not common goal. Take for example that I am running 200 virtual machines. Each virtual machine has a different role and different needs. They are like 200 garage bands playing at the same time. No amount of conducting can solve the lack of similar goals. It will sound really bad or least really loud. Each application really needs their own conductor and space. They need to be able to get access to resources in an intelligent way without effecting other applications.

Who is the conductor?

Like it or not each of our applications is our own conductor. Treating them as a single entity with the same metrics is only asking for trouble. We have been given a number of tools in the compute arena to manage individual applications like reservations, DRS, SDRS, NIOC etc. This allows the conductor of vSphere to understand some metric around our little bands. This knowledge is even automated from time to time to make out life easier for example DRS. This understanding of our applications ends at the compute layer. Storage and networking treats everyone the same. There have been some inroads into this problem: QoS and IOP’s allocations. At the end of the day storage systems want to deal with writes and reads, network wants to deal with transfer of data and neither wants to be intelligent about the 200 applications running on those ESXi hosts. When I provisioned storage to a single server it was easy. Now I provision storage to potentially 32 servers running 4,000 little bands. I need a master conductor, I need agility, I scale up, I need unity, I need something that allows my application to be their own conductors and most of all I need intelligence. I need all these things to work together in concert at my individual operating system layer. I need virtualized networking and storage. I need the same magic VMware brought with ESXi to the other realms. This post is not a slam on vendors they do an awesome job and I geek out on their stuff everyday. This is not easy or it would already be done. There are vendors out there doing parts of this today. We need to find them and support them to bring change.

Like this:

I have been working with vSphere to get internally generated Alarms an SNMP Trap for ticket generation. This process seemed simple on the surface but proved quite challenging. The high level steps are as follows:

Configure the Alarm Action in SNMP Trap

I ran into a number of issues that generated this community post . The essential issue is that all SNMP events generated by vSphere come in as the same type of event vpxaAlarmInfo. The details of the event contains information an internal name. This is where the problem begins. The name for any custom created event is the name of the event. For example if I create a Alarm called JoeTest then it’s called JoeTest. Sounds simple right? Well… no because the VMware built in alarms don’t following this naming convention. The Host connection and power state (easiest one for me to generate) is named alarm.HostConnectionStateAlarm. Making my mappings for any VMware generated events very hard. So I went on a quest to locate these names.

The Quest for the names

My first stop was PowerCLI using the command:

$bob = Get-AlarmDefinition -Name "Host connection and power state"

$bob | fl

This fine powershell did not produce the alarm.HostConnectionStateAlarm name. It did produce a Alarm-145 (unique to my vCenter). I tried lots of ways to work on this object like get-view etc… without any luck.

My next stop was the MOB (Managed Object Browse) also known as my least favorite place. Using the following MOB I was able to learn everything about the alarm except the name for SNMP:

https://vcenter/mob/?moid=alarm-145

https://vcenter/mob/?moid=alarm-145&doPath=info

https://vcenter/mob/?moid=alarm-145&doPath=info.action.action

https://vcenter/mob/?moid=alarm-145&doPath=info.expression.expression

https://vcenter/mob/?moid=alarm-145&doPath=info.setting

This lead me to my last stop the vCenter database. Some finely crafted searches produced a number of tables with the alarm.xxx information. I was left with the VPX_EVENT_ARG table. It seems to be a table of all events in the system. Inside this I was able to locate names that seemed to fit. A few more minutes did not produce any primary keys to link to the Alarm tables. I was stuck so I punted. The following is a SQL command I used to produce the Alarms names:

select distinct OBJ_NAME from [vCenter].[dbo].[VPX_EVENT_ARG] where obj_name like ‘%alarm%’

Like this:

Yes a big autodeploy post is going to be following up soon. I can really seen the benefit of auto deploy in larger environments. I’ll be posting the architectural recommendations and failure scenarios soon. Today I am posting about stateless cache and USB.

What is stateless cache and why do I care?

Stateless cache allows your auto deployed ESXi host (TFTP image running in memory) to be installed on a local hard drive. This enables you to boot the last running configuration without the presence of the TFTP server. It’s a really good protection method. It is enabled by editing the host profile and in 5.5 it can be enabled using the fat client:

Select the profile and right click on it

Select Edit

Expand System Image Cache Configuration

Click on System Image Cache Profile Settings

Select the drop down and choose the stateless caching mode you want.

This all sounds great but we had a heck of a time trying to get it to stateless cache to SD cards on our UCS gear. A coworker discovered that SDcards are seen as USB devices. Once we select “Enable stateless caching to a USB disk on the host” everything worked.

Design Constraints

Using stateless caching will protect you against a failure of TFTP and even vCenter but DHCP and DNS are both still required for the following reasons:

DHCP to get IP address information

DNS to get hostname of ESXi host

Stateless does not remove all dependencies but it does allow quick provisioning.

Like this:

This is huge for those willing to make the investment. This applies to VCP-Cloud or VCP-DT only. You can submit a VCDX Design in those fields and potentially get the VCDX without passing the VCAP or new VCIX exams. As a current VCP-Cloud and VCDX-DCV I am super temped to make the next three weeks really bad for the chance at a second VCDX. If you have one of these VCP’s you should really go out grab a VCDX mentor and make it happen. It’s roughly $300 dollars to submit a design, worth every penny. Take it from me the VCDX is huge. It’s my understanding this only applies to the April submission deadline of the 1st. Good luck.

Like this:

This section deals with upgrading from older versions of vShield to NSX. The simple answer is there is a specific order that must be followed. Upgradeds from vShield require version 5.5. Most of it is in the GUI via vCenter except the vShield Manager which will be replaced by NSX Manager. Most of these processes roughly follow the documented process in this document.

Products name translation:

Roughly here are the old names to new names or new service providing function:

vShield Manager -> NSX Manager

Virtual Wires -> NSX Logical Switch

vShield App -> NSX Firewall

vShield -> NSX Edge

vShield Endpoint ->vShield Endpoint

Data security -> NSX Datasecurity

Practicing this process:

Unless you want to take a few hours configuring all vShield products it’s hard to practice. You can do the upgrade from vShield Manager to NSX manager really quickly. Just download the vShield Manager and setup with the following:

Deploy OVF

Power on

Console login as admin with password of default

type enable with password of default

type setup

Setup your IP settings

Wait 5 minutes

Login via IP with web browser and do upgrade

The rest of the upgrade requires you understand vShield products which is not required for NV so I vote you skip it and be familiar with process, order and requirements.

Objective 1.2 Denotes the following items:

Upgrade vShield Manager 5.5 to NSX Manager 6.x.

Upgrading vShield Manager to NSX Manager can only be done from version 5.5 of vShield. It also requires the following things:

Virtual wires must be upgraded to NSX logical switches to use NSX features. The process is required even if you don’t use virtual wires. In order for this to work you need to upgrade your vShield manager to NSX manager and make sure it’s connected to vSphere.

Process

Login to Web client

Networking and Security Tab click install

Click host prepare

Virtual wires will show as Legacy

Click update on each wire

Wait for them to show green and no longer legacy

Upgrade vShield App to NSX Firewall

You can only upgrade vShield App 5.5 to NSX. It requires that vShield manager be upgraded to NSX manager and virtual wires upgraded to NSX logical switches.

A pop up window should ask if you want to upgrade

Click update and wait

Done

Upgrade vShield 5.5 to NSX Edge 6.x

This upgrade requires the following:

vShield 5.5

NSX Manager

Virtual wires upgraded to NSX logical switches

Processes:

Login to web client

Networking & Security tab

NSX Edges button

Select upgrade version from actions menu on each edge

After compete check the version number tab

Upgrade vShield Endpoint 5.x to vShield Endpoint 6.x

This upgrade requires the following:

vShield Manager upgraded to NSX Manager

Virtual wires upgraded to NSX Logical switches

Process:

Login to web client

Networking & Security tab

Click Installation

Click Service deployments tab

Click on upgrade available

Select datastore (must be shared) and network and ok

Upgrade to NSX Data Security

There is no clean upgrade path you have to remove before install of NSX manager. You have to re-register the solution with NSX if available.

Like this:

Post navigation

About Me

Joseph Griffiths is a virtualization focused systems architect who works in complex cloud based solutions. He currently holds many IT certifications including VMware VCDX-DCV #143. This blog represents his random technical notes and thoughts. The thoughts expressed here do not reflect Joseph's current employer in anyway. You can follow Joseph on Twitter @Gortees