Thursday, May 19, 2011

When Windows Azure performs a rolling upgrade of their compute clusters they bring down only one fault domain at a time. If you have only one instance of your Windows Azure Web or Worker Role running, when they bring down that fault domain for the upgrade your only instance will go down. In order to maintain the Windows Azure SLA you need to have more than one instance running. Multiple instances are spread evenly across all the fault domains automatically when the role is deploy. If you are running more instances then there are fault domains, some of the instances will be on overlapping fault domains, however they are still spread as evenly as possible.

During a rolling update, Windows Azure does not bring a second instance online on the upgraded fault domain before shutting down your first (non-upgraded) instance. There are two reasons for not bring another instance online:

There is now way to know if your deployment supports having N+1 instances – since you requested exactly ‘N’ instances, bringing up N+1 (even temporarily) doesn’t break your design. For example, since this might be internal role, you may have your own load balancing algorithm that’s dependent on number of instances. Or you require a single instance of this specific role (e.g. coordinator).

If a new instance was brought online, it will not have the local storage of the previous instance. So it will behave like a re-image for the role. After the upgrade it is more than likely your old instance (and the local storage) will return. Which is optimal to a new instance with a ‘stale’ version of the local storage.