Archive for September, 2016

I came across the following interesting situation with an 8600 StorSimple device running software version 3.0 (17759).

Problem description

All iSCSI volumes from the StorSimple device in question are down at the Windows 2012 R2 host (about a dozen volumes in this case). If you create a new volume and present it to the Windows host, attempting to partition it (GPT) fails with the error message ‘disk not ready’.

2 Additional observations were made:

All cloud snapshots failed about a month before this incident.

Software update 3.0 was applied to this device roughly the same time the incident occurred. The accompanying firmware update was not applied.

The time line is as follows:

Storage account was deleted prior to 6/28/2016 (not showing in Operation Logs which are kept for 90 days)

60+ days later (8/28/2016) all cloud snapshots started to fail. The error message suggests failure to access the storage account

82+ days later around 9/20/2016, users started to report volumes not available

9/24/2016, software update 3.0 was applied from the classic portal

Initial testing

About a dozen volumes were provisioned from this device to one Windows 2012 R2 host. Volume Containers were associated with 3 Storage Accounts in the same subscription

One of the 3 Storage Accounts (the one on the top of the above image) was missing. Apparently it was inadvertantly deleted.

Get-HCSSystem showed normal device condition

iSCSI connectivity including iSCSI initiator and MPIO configuration were reviewed, tested and showed no issues.

Ping (Test-Connection) and tracert.exe (Trace-HCSRoute) from each of the host iSCSI interfaces to each of the device iSCSI interfaces and back came back OK.

Test-HCSMConnection showed no problems.

Test-HcsStorageAccountCredential against the 2 existing Storage Accounts showed no problem.

Troubleshooting and Root Cause Analysis

I initially suspected that the Storage Account keys were changed without getting synchronized with the StorSimple Manager service, cutting off the device from its Storage Accounts. That would explain volume failure of all volumes and cloud snapshot failure. Both of which need to read and write to Storage Accounts.

However, Operation Logs showed no events or records related to change of Storage Account keys associated with a StorSimple volume.

Operation Logs showed no event/record of Storage Account deletion.

Synchronizing the Storage Account keys of the 2 existing Storage Accounts did not solve the problem.

After opening a ticket with Microsoft, they obtained a device Support Package and recognized that the device appears to be constantly trying and failing to reach the volume whose Storage Account is deleted which is causing failure to serve the remaining unaffected volumes.

All 3 volumes will fail (inaccessible) after some time (see questions and answers section below about how much time)

Solution

Create a Storage Account with the same name as the one that was accidentally deleted, and synchronize the keys with the StorSimple Manager service

Although this solution will untie the device to serve the volumes whose Storage Accounts have not been deleted, it does not restore the volume(s) whose data is lost when their Storage Account was deleted. Such volumes’ data need to be restored from snapshot.

Questions and answers:

If the Storage Account has been deleted 60+ days before cloud snapshots started to fail, what prompted the cloud snapshot failure if that was caused by Storage Account deletion?

If the Storage Account has been deleted 82+ days before volumes started to fail, what prompted volume failure if that was caused by Storage Account deletion?

[Microsoft]:

What happened here is that eventually, all the failed authentication attempts to the deleted storage account filled out the barrier queue (queue to the cloud). Once it is filled beyond a certain point, it becomes completely stuck, and anything in line behind it is unable to get through. It was once the barrier queue was completely overrun with all these connection issues to the deleted storage account that cause all other cloud traffic to be affected. This has the same effect as losing your cloud connection, and with this being a hybrid appliance when this happens it can cause many different issues such as we saw here with volumes being unavailable and backups unable to complete.

Recommendations to Microsoft

Log events of Storage Account key changes in Operation Logs

Currently 90 days worth of events show up in Operation Logs. It would be helpful if that retention period is configurable by the client on each subscription

Make the device Support Package available to the client without the need for a key from Microsoft. In this case, information available only in the Support Package held the key to the workaround/solution.

Update the device software so that loss of a Storage Account affects only its associated volumes not all volumes (Perhaps a separate queue per volume container instead of a queue per device)

Update the StorSimple Manager service or/and Storage Account so that a Storage Account cannot be deleted if there’s an associated StorSimple Volume Container

This example checks for connections where localhost is listening on TCP port 5985 (VMM Agent which uses WBEM WS-Management HTTP), and returns the IP address of the remote host (VMM server). VMM being System Center Virtual Machine Manager. If it returns nothing, this means this machine is not listening on port 5985 (VMM agent not running)

You may have the situation where you need to move your StorSimple 8k iSCSI SAN from one physical location to another. Assuming that the move is not so far as to move to another continent or thousands of miles away, the following process is what I recommend for the move:

On the file servers that receive iSCSI volumes from this StorSimple device, open Disk Management, and offline all volumes from this StorSimple device

(Optional) In the classic portal, under the device/maintenance page, install the latest Software and Firmware update. The reason this unrelated step is here, is to take advantage of the down time window to perform device update. This may take 1-12 hours, and may require access to the device serial interface.

Ensure that you have the Device Administrator password. You’ll need that to change the device IP configuration for the new site. If you don’t have it, you can reset it by going into the classic portal, under the device/configuration page.

Power down the device by going to the classic portal, under device/maintenance, click Manage Controllers at the bottom, and shutdown Controller0, and repeat to shutdown Controller1

After the device is powered down, toggle the power buttons on the back on the PCM’s to the off position. Do the same for the EBOD enclosure if this is an 8600 model device.

Move the device to the new location

Rack, cable, and power on the device by toggling the power buttons on the back of the PCM modules.

In the serial console,

Type 1 to login with full access, enter the device Administrator password.

Type in Invoke-HCSSetupWizard, enter the new information for data0 interface: IP, mask, gateway, DNS server, NTP server, Proxy information if that’s needed for Internet access in the new site (Proxy URL as http://my.proxy.domain.com:8888, authentication is typically T for NTLM, Proxy username and password if needed by your Proxy – Proxy must be v1.1 compliant)

Back in the classic portal, you should see your device back online, go to the device/configuration page, update any settings as needed such as controller0 and controller1 fixed IPs, and iSCSI interface configuration if that has changed.

This post lists StorSimple software versions, their release dates, and major new features for reference. Microsoft does not publish release dates for StorSimple updates. The release dates below are from published documentation and/or first hand experience. They may be off by up to 15 days.

Major new features: (Azure-side) Migration from legacy 5k/7k devices to 8k devices, support for Azure US GOV, support for cloud storage from other public clouds as AWS/HP/OpenStack, update to latest API (this should allow us to manage the device in the new portal, yet this has not happened as of 9/9/2016)

This post describes one experience of updating StorSimple 8100 series device from version 0.2 (17361) to current (8 September 2016) version 3.0 (17759). It’s worth noting that:

StorSimple 8k series devices that shipped in mid 2015 came with software version 0.2

Typically, the device checks periodically for updates and when updates are found a note similar to this image is shown in the device/maintenance page:

The device admin then picks the time when to deploy the updates, by clicking INSTALL UPDATES link. This kicks off an update job, which may take several hours

This update method is known as updating StorSimple device using the classic Azure portal, as opposed to updating the StorSimple device using the serial interface by deploying the update as a hotfix.

Released updates may not show up, in spite of scanning for updates manually several times:
The image above was taken on 9 September 2016 (update 3.0 is the latest at this time). It shows that no updates are available even after scanning for updates several times. The reason is that Microsoft deploys updates in a ‘phased rollout’, so they’re not available in all regions at all times.

Updates are cumulative. This means for a device running version 0.2 for example, we upgrade directly to 3.0 without the need to manually upgdate to any intermediary version first.

An update may include one or both of the following 2 types:

Software updates: This is an update of the core 2012 R2 server OS that’s running on the device. Microsoft identifies this type as a non intrusive update. It can be deployed while the device is in production, and should not affect mounted iSCSI volumes. Under the covers, the device controller0 and controller1 are 2 nodes in a traditional Microsoft failover cluster. The device uses the traditional Cluster Aware Update to update the 2 controllers. It updates and reboots the passive controller first, fails over the device (iSCSI target and other clustered roles) from one controller to the other, then updates and reboots the second controller. Again this should be a no-down-time process.

Maintenance mode updates:

These are updates to shared components in the device that require down time. Typically we see LSI SAS controller updates and disk firmware updates in this category. Maintenance mode updates must be done from the serial interface console (not Azure web interface or PowerShell interface). The typical down time for a maintenance mode update is about 30 minutes, although I would schedule a 2 hour window to be safe. The maintenance mode update steps are:

On the file servers, offline all iSCSI volumes provisioned from this device.

Log in to the device serial interface with full access

Put the device in Maintenance mode: Enter-HcsMaintenanceMode, wait for the device to reboot

Identify available updates: Get-HcsUpdateAvailability, this should show available Maintenance mode updates (TRUE)

Start the update: Start-HcsUpdate

Monitor the update: Get-HcsUpdateStatus

When finished, exit maintenance mode: Exit-HcsMaintenanceMode, and wait for the device to reboot.