We had a client out West who have been undergoing network infrastructure changes and after completing the majority of the project, they began to notice that an error message intermittently gets displayed for users when they use MOC to call other remote users connected through the Edge server. The error message reads:

Some calls to and from people outside of your corporate network may not connect due to server connectivity problems. Try signing out and signing back in. If this problem continues, contact your system administrator with this information.

Since this appears to only affect calls from the internal network going out through the Edge server, I went straight to the front-end server and ran the following validation tests:

Front-End Server Validation:

Web Conferencing Server Validation:

A/V Conferencing Server Validation:

To eliminate a connectivity issue, I did a simple telnet from the front-end server to the edge server via the port 5062 and the connection establishes. I also checked the static persistent routes on the edge server to ensure the appropriate routes for the 172.x and 10.x subnets are in there.

All of these errors point to a mismatch of what the front-end server anticipates and what is presented. In this client’s case, it’s because the internal interface on the Edge server is currently using the public certificate with the name sip.companyName.com. When the internal front-end server initiates a connection over to the Edge server, it uses the name “xxx-edge-01.someDomain.local”.

To confirm that the edge server is indeed presenting the wrong interface, we can look at what certificate is assigned as shown here in the properties:

I went ahead and double checked the sip.companyName.com certificates and none of them have a SAN entry for the internal name. This means that they should use an internally issued certificate with the proper internal edge server name for the internal interface. The screenshot above shows that there’s actually such a certificate but it’s not assigned.

Resolution

The resolution is simple: all we need to do is assign the internal certificate. I’ve verified the expiry date for the certificate and it hasn’t expired yet.

Once I changed the certificate, the error in the validation went away and users stopped receiving the error message.

Tuesday, September 28, 2010

I ran into an interesting issue a few months back when a client with OCS 2007 R2 called our support services because all of the translation rules in his location profile simply stopped working. The problem was eventually escalated to me so I logged into the client’s network remotely and started looking around. It’s been more than 4 months so I couldn’t remember what I looked at before arrived to the solution so long story short, I eventually opened up Enterprise Voice Route Helper:

…and imported the routing data:

What I was basically looking for was to see if there was a normalization rule put at the top of the rules list that somehow caught all cases. I wasn’t able to spot anything out of the ordinary while scrolling down the list so I tried to perform and ad-hoc test with my cell number and this was where I got luck because I got the following message in the results:

The specified profile contains one or more rules that were invalid.

The profile cannot be used.

As shown in the screenshot above, there was a translation rule entered into the profile with an improper syntax and thus rendered the whole location profile unusable. This was an obvious typo the client made and I found it very interesting that the management console actually accepted it. Everything started working once I fixed the missing bracket.

Monday, September 27, 2010

We had to do some emergency maintenance with a new NetApp shelf a few days ago and found that because we were working with an older version of the firmware, there were commands that did not exist on the filer so we ended up using the GUI and CLI to work around the problem.

Note: Some of the commands may not be necessary but I wanted to list all the steps we had to take to get this to work so I’ve highlighted the steps that is possibly not needed in RED. Please also take note that we had a small window to work with so the instructions below may not abide by best practices.

Task: Disks in the new shelf has been assigned to 2 separate controllers (6 each).

Problem: We could not find the command to unassign the disks in this version of the firmware so we had to use the combination of the GUI and CLI to remove the disk from the controller and then reassign it to another.

We tried the remove option as well as the replace but found that just as the description specifies, these commands expect you to be moving the spares around. As we were pressed for time to get the disks reassigned, we went into the GUI to offline these disks by setting them to remove:

After removing these 3 disks from the GUI, a disk show now shows the following:

Since the disk reassign command can only be run in maintenance mode or during takeover in advanced mode, we executed the disk remove_ownership command but before we can execute that command, we needed to elevate our privileges to advanced:

FAC01> priv set advancedWarning: These advanced commands are potentially dangerous; use them only when directed to do so by Network Appliance personnel.FAC01*>

Then we executed:

FAC01*> disk remove_ownership 0b.17Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.19Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.21Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.Volumes must be taken offline. Are all impacted volumes offline(y/n)?? yFAC01*>

The following is the output when we execute a disk show after the above commands were completed:

FAC01*> disk remove_ownership 0b.17Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

Notice that the message indicates that disk.auto_assign is turned on so in order to have these disks remain unassigned, we need to execute the following:

FAC01*> options disk.auto_assign offYou are changing option disk.auto_assign which applies to both members ofthe cluster in takeover mode.This value must be the same in both cluster members prior to any takeoveror giveback, or that next takeover/giveback may not work correctly.Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.FAC01*>

FAC02*> options disk.auto_assign offYou are changing option disk.auto_assign which applies to both members ofthe cluster in takeover mode.This value must be the same in both cluster members prior to any takeoveror giveback, or that next takeover/giveback may not work correctly.Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.FAC02*>

While performing some troubleshooting with a client a few weeks ago on their vCenter 4.1 server, I learned that vCenter 4.1 actually uses a 64-bit DSN. This meant that we no longer have to go into:

Install the 32-bit client via Microsoft’s downloads

Run 32-bit ODBC via the WoW64 folder

Create a 32-bit DSN

…which you had to do for a vCenter 4.0 install.

The following is the difference between the 2 version’s:

vCenter 4.0

Database Options

Select an ODBC data source for vCenter Server.

vCenter Server requires a database.

Use an existing supported database

Data Source Name (DSN): (Please create a 32-bit system (DSN)

vCenter 4.1

Database Options

Select an ODBC data source for vCenter Server.

vCenter Server requires a database.

Use an existing supported database

Data Source Name (DSN): (Please create a 64-bit system (DSN)

Now I’m just waiting for VUM (Update Manager) 4.1 to support a 64-bit ODBC DSN because it’s currently still a requirement for it to be installed on a 64-bit Operating System yet use a 32-bit DSN.

Update

The following might be helpful when configuring the 64-bit DSN:

Make sure you select SQL Server Native Client 10.0 for the System DSN and not the regular SQL Server or the vCenter installation wizard won’t detect it (you can’t just hit the back button then forward again to see the DSN).

Sunday, September 26, 2010

Warning: I’m not a SAN expert but as I’ve gotten more opportunities to work in datacenter projects, I’m beginning to see more real world SAN implementations and while this doesn’t provide a complete breakdown of what to consider while calculating raw and usable storage, I hope this will at the very least provide some useful information to professionals out there looking for some real world numbers when provisioning SAN storage.

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAIDSize: 16

NumberofDisks: 6

Disk Size: 300GB SAS

Actual usual disk size: 266GB

Total Aggregate Capacity: 908GB

As shown with the information listed above, configuring a FAS2020 with 6 x 300GB SAS drives realistically yields only 908GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 300GB x 6 disks = 1.8TB

Actual drive capacity: 266 x 6 disks = 1.598TB

Actual useable *aggregate* capacity after RAID_DP: 0.908GB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 51% of drive space. This 51% also does not include the spare disk you’ll need per controller (you need a disk for each controller so if you have an active/active setup, you’ll need 2 disks for each controller). Also don’t forget that the software for the controller also sits in aggregate 0 on the NetApp which means that will take up additional space. As of the year 2009, the NetApp technician told me that a minimum of 10GB is required for the root volume and 20GB is recommended for the FAS2020.

Lastly, as new volumes are created for LUNs, your volume needs more space than the actual LUN and the reason for this is because you will need extra space if you decide to use snapshots. Best practice as told by the NetApp engineer is that you should have 2x + delta (x being the size of the LUN) extra space because it covers the situation if a snapshot is taken of the LUN (let's say LUN was completely full), deleted all information on the LUN, filled it back up with different information but because the 2x + delta was followed, this means that your snapshot can hold all the information prior to deleting the original information. With that being said, as most companies don’t like to lose so much storage, another good practice is to use 1x + delta (x being the size of the LUN).

There are times when thinking about all the reasons that contribute to lost storage in exchange for redundancy often scares me so I find that it’s ever so much more important to communicate to customers all the variables and set their expectations appropriately.

The following is another example similar to the configuration above but with 1TB drives:

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAIDSize: 6

NumberofDisks: 6

Disk Size: 1TB SAS

Actual usual disk size: 828GB

Total Aggregate Capacity: 2.76TB

As shown with the information listed above, configuring a FAS2020 with 6 x 1TB SAS drives realistically yields only 2.76GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 1TB x 6 disks = 6TB

Actual drive capacity: 828 x 6 disks = 4.968TB

Actual useable *aggregate* capacity after RAID_DP: 2.76TB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 54% of drive space. As indicated in the example above, this 54% also does not include other variables that will contribute to more loss in storage space.

Here’s another example similar to the first one but with 4 disks instead:

Configuration

Brand: NetApp

Model: FAS2020

Version: 7.2.6.1

RAID: RAID_DP (Double Parity)

RAIDSize: 16

NumberofDisks: 4

Disk Size: 300GB SAS

Actual usual disk size: 266GB

Total Aggregate Capacity: 454GB

As shown with the information listed above, configuring a FAS2020 with 4 x 300GB SAS drives realistically yields only 454GB for the aggregate. Working out the numbers we can see that:

Specifications on paper: 300GB x 4 disks = 1.2TB

Actual drive capacity: 266 x 4 disks = 1.02TB

Actual useable *aggregate* capacity after RAID_DP: 454GB

If we divide the numbers to get the amount of storage space you lose from the overhead such as RAID, we’re actually losing approximately: 63% of drive space. Again, this does not include other contributing factors that will decrease the amount of usable storage even more.

Random notes I took during troubleshooting with NetApp engineer: There are ways to reclaim space such as snapshots for your aggregates, volumes, reducing factional reserve, reducing snapshot schedules’ frequency but they all contribute to reduced redundancy. Also, make sure a 1 LUN per volume mapping is followed in case a volume ever goes down, not all of your LUNs do. Lastly, make sure auto snap auto delete is turned on because if no space is left for snapshots, the NetApp will delete the old one and take the snapshot. If this was not turned on, the LUN will go offline if it fills up without space reservation.

My thoughts: Being as a consultant means we’re obligated to pass the truth about our knowledge to customers and while this may hard to digest for many clients, it’s important not to overlook why companies purchase SANs in the first place: because they want robust storage that provides redundancy, exceptional recovery time and performance and storage companies design their storage solutions with this as their number 1 priority. I’ve been fortunate enough to be at a training session delivered by Peter Henneberry from NetApp and it was quite the eye opener when he gave us real world statistics on how they can complete backups within the seconds or minutes rather than hours so while I can’t state all the benefits of a SAN, there are plenty of reasons.

I’m not much of storage consultant even though I’d like to get into it a bit more so please forgive any mistakes I have made in this post whether it’s calculations or information I have missed.

Other than having incorrect memory ordered and some other issues, I was unable to have ESXi 4.0.0 see the HP StorageWorks 42B HBA.

As shown in the screenshot below, I was unable to have ESXi 4.0.0 build 261974 see the vmhba after a fresh install.

After doing a few searches on Google and finding various posts about people having problems with installing 2 of the quad port cards but managed to get it going after upgrading the drivers or ESXi, I decided to try upgrading ESXi from 4.0.0 to 4.1.0 (http://terenceluk.blogspot.com/2010/09/updating-vsphere-esxi-from-40-to-41.html) assuming that the newer ESXi would may have the right drivers for the HBA.

The upgrade went without a hitch but I was still unable to see the HBA after getting ESXi to version 4.1.0 so I went back to the internet to do some searches and found that there were ways to install additional drivers for ESXi and coincidentally, you use the VMware vSphere CLI to do it (see my previous post about why this was coincidental). The next step I did was to try and find the HBA drivers. Through reviewing the description the BOM, I find the description:

Based on what I found on the HCL, it does look like it’s supported so I went back to the VMware downloads section to try and find drivers offered directly from that site. What I found was the following:

DescriptionThis driver CD release includes support for version 2.1.1.1 of the Brocade BFA driver on ESX/ESXi 4.0. This BFA driver supports products based on the Brocade 825, 815, 425, and 415 Fibre Channel host bus adapters (HBA).

Looking into the offline-bundle folder, I found the following package:

BRCD-bfa-2.1.1.1-00000-offline_bundle-285864.zip

Drilling into that .zip package showed that it contains the following:

metadata.zip

vmware-esx-drivers-scsi-bfa-400.2.1.1.1-1OEM.x86_64.vib

This was when I was sure that this package was the one I wanted so i went ahead and extracted the BRCD-bfa-2.1.1.1-00000-offline_bundle-285864.zip package then went back to vSphere CLI to try and update the drivers with it:

The update completed successfully, but the system needs to be rebooted for the c

hanges to be effective.

C:\Program Files\VMware\VMware vSphere CLI\bin>

Success!

Once the update completed, I went and fired up VI Client, connected to the host, navigated to the storage adapters section and now I can see the vmhba listed!

Now that I got the HBA to show up on the host, I knew I needed to do one last step and that was to ensure that I did use the right driver because once this goes into production, we won’t be able to do any more testing with it.

Within the storage adapters section, ESXi lists the adapters as Brocade-425/825:

… so I went ahead to look up the reference guide on the Brocade site to determine if there was a cross reference guide available and indeed there was:

Thursday, September 23, 2010

We’re currently in the process of refreshing a client’s VI3 environment to vSphere 4.1 and procured a new server to add to the existing cluster. While performing the ESXi install on the new server, I did not have a ESXi 4.1 CD available so I went ahead to install 4.0 and figure I’d update it with the vSphere Host Update Utility. For those who have read the vSphere 4.1 release probably already know that you cannot update the host from 4.0 to 4.1 with that utility so this post serves as to show how you can update the host with the vSphere CLI.

I started off with using the VMware vSphere Host Update Utility I had installed on my laptop to try and update the ESXi 4.0.0 build-261974.

As shown in the following screenshot, scanning a fully patched ESXi 4.0.0 won’t give you an option to upgrade the host to version 4.1.0.

(The build I downloaded for this upgrade was: VMware-vSphere-CLI-4.1.0-254719.exe)

… and began installing it:

The following screen too extremely long to finish and I remember not having this issue on my last deployment when I installed it on a server so my guess is that I had some other application on my laptop that caused the delay.

Once I completed the installation, I went ahead to download the upgrade package. Make sure you download the proper upgrade package in a ZIP package and not the regular installable ISO as the latter will not allow you to use vSphere CLI to upgrade the host.

While downloading the package, we can spend the time we need to wait to put the host into maintenance mode:

Once you’ve downloaded the zip package, DO NOT uncompress it. Simply place it into a directory of your choice and then open up the VMware vSphere CLI.

C:\Program Files\VMware\VMware vSphere CLI>vihostupdate

'vihostupdate' is not recognized as an internal or external command,

operable program or batch file.

C:\Program Files\VMware\VMware vSphere CLI>dir

Volume in drive C has no label.

Volume Serial Number is 4802-7E84

Directory of C:\Program Files\VMware\VMware vSphere CLI

09/23/2010 06:50 AM <DIR> .

09/23/2010 06:50 AM <DIR> ..

09/23/2010 06:50 AM <DIR> bin

09/23/2010 07:03 AM <DIR> Perl

09/23/2010 06:49 AM <DIR> PPM

0 File(s) 0 bytes

5 Dir(s) 7,619,219,456 bytes free

C:\Program Files\VMware\VMware vSphere CLI>cd bin

C:\Program Files\VMware\VMware vSphere CLI\bin>

As shown in the above screenshot, the vihostupdate.pl script is actually in the C:\program files\VMware\VMware vSphere CLI\bin directory.

In the screenshot above, I actually made 2 mistakes, the first one being running vihostupdate without the .pl extension.

The 2nd mistake is shown in the screenshot below:

I originally unzipped the package because I thought executing the vihostupdate.pl was supposed to be done on a directory when in fact it actually expects a zip package. The following is the output and I’ve also highlighted the error if you were to specify a directory:

Thoughts: Coming from a Windows background, I personally don’t like to do upgrades and I was told by my colleague that our practice lead recommends simply reinstalling ESXi on the host. The problem I have with that is that you lose all your settings so if you have a lot of hosts, this option might be a better route to take.

I hope this has been beneficial to the other professionals out there and possibly even save them some time.