New Years Resolution: Post-Checks for a Nimble Install

I thought I would write a quick blog to bring in the New Year! It's the time of year that we make New Year's resolutions, maybe to challenge ourselves to accomplish something; Drink/eat less, exercise more, save more or set a goal to complete a marathon, visit a place or achieve a qualification! I thought I would dedicate this post to all the things you should check with your running Nimble array (whether it be an All-Flash or an Adaptive Array) once it has been installed. These are my best practices that should be checked after every install (whether it's been completed by yourself, a Partner Engineer or even an Nimble Engineer).

So here goes:

Check and Test Autosupport is enabled

What to check?

Is Autosupport enabled within NimbleOS. To check this simply go to the management GUI and check Administration > Alerts and Monitoring > Autosupport & HTTP Proxy. Next check that Send AutoSupport Data to Nimble Storage Support checkbox is ticked and then Test AutoSupport Settings. The test should come back all green, like the picture below:

Top Tip: If the check comes back red it means one of the tests has failed. Click the triangle icon to explode all the tests to verify which one failed. If the failure is Ping from controller A | B IP failed then this probably means that ICMP is being blocked by your firewall. It actually doesn't cause too much of an issue if this fails but if you want the test to pass green then set your firewall rules to allow outbound ICMP from the controllers management IP address and the two diagnostic IP addresses and retry to check. The full list of firewall ports to be enabled is here.

Why is this important?

This check is really important to ensure alerts and auto support telemetry are getting back to Nimble Support. If this checkbox is not checked or there is a failure on the verification then Nimble Support will not be receiving any telemetry from your controllers and therefore InfoSight will also not work.

Check and Test Email Alerts are enabled

What to check?

This check is very similar to AutoSupport above, to ensure email alerts are configured. To check this simply go to the management GUI and check Administration > Alerts and Monitoring > Email Alerts. Check the Send Event Data to Nimble Storage is ticked and then click the Test button at the bottom.

After clicking Test the mailbox that is listed in the 'Send To' address should receive an email similar to the one below:

Why is this important?

This check is really important as really it's a safety mechanism, if AutoSupport was to fail for some reason (e.g, the firewall ports were to become blocked) then it provides an alternative mechanism to get the alerts to Nimble and ensure any events on the array are known by Nimble Support.

Top Tip: If the tests succeed and no mail arrives, double check your junk folder in mail, if there is still no mail then check that your Exchange relay has been configured correctly. The Knowledge base article for the check is here

Check the Physical Health of the Array (Interfaces, Disks, Fans etc)

What to check?

This is an obvious check, but one well worth making. Click on Manage > Arrays > Select your array serial number at the bottom and check to ensure all the onboard and ensure everything is green. Any red means there is a problem (this will also be shown on the events page).

Your view should look similar to this AFA:

Ideally all of your onboard ports (eth1/eth2) and data ports tg1-tg4) should be lit green. Any red here means those ports are down and not connected.

The array above is a lab system so it's not fully cabled. In a production environment every port should be green, if it's not please talk to your partner and/or Nimble SE to understand why and implications.

Top Tip: Mousing over the ports/disks/fans will show more information and physically identify the port incl it's MAC address and negotiated speed:

Why is this important?

It ensures that your controllers are fully resilient and there has been no failures in the shipping. Any failures can be rectified before live services are migrated and of course a full resilient system will provide peace of mind as you start to migrate your applications to their new home.

Asset Address is correct in Infosight

What to check?

Again this can be fairly obvious check but can often be overlooked. Login to Infosight and click Administration > Asset Registry

You should see a row for each controller that is registered to your company (similar to below):

Check that the install address is correct and most importantly check that the RMA Part Delivery address is correct.

Why is this important?

The Install address is where we will send the Engineer to site (assuming you have purchases 4Hr OnSite Engineering Assistance). Also the RMA Part Delivery address is the location we will send any replacement parts. There are many customers that have a central IT function but have their arrays installed in different locations and may want their parts all delivered to Central IT rather than the install address. This makes sure we get it right first time and there is no confusion should any replacement parts be needed in the future.

Top Tip: You can verify the RMA Part Delivery address by clicking confirm. Also, if the address is incorrect or you physical move the array you can update the arrays location using this and also provide instructions on when each site is attended and parts can be delivered.

InfoSight is receiving data and VMVision is configured (currently VMWare only customers)

What to check?

If you have setup AutoSupport correct (as described in the first step) then the array should be sending it's telemetry to Nimble Support.

You an check to see whether AutoSupport heartbeats are being received by checking with the Asset Dashboard in InfoSight. Go to InfoSight and click on Reports > Asset Report.

Each array will be shown as per below:

Notice the icons in the bottom left hand corner. Each will tell you whether AutoSupports are being received (these can take up 24 hours), Heartbeats are being received and also the Support Contract for the array (and when it expires).

Quite often we will see installs where VMVision hasn't been configured. If you are not sure what VMVision is, then please check out my blog here, it basically provides per-VM monitoring to VMWare customers, despite the value it provides it costs nothing and requires no software to be installed. There are two steps to ensuring VMVision is configured:

Ensure the VMPlugin has been installed / registered within VMWare. This is achieved by selecting Administration > VMWare Integration from the Array GUI and providing your VMWare vCenter credentials:

Once the plugin has been registered, the second step requires you to Enable VM Vision, to provide Nimble permission to collect and display this information. This is achieved by heading to Infosight and clicking Manage > VMVision. If you see a screen similar to the one below, then VMVision hasn't been enabled:

Enabling Infosight and ensuring it is streaming data allows you to enable Predictive and Proactive monitoring which is supported by Infosight. VMVision allows you to manage the end-to-end infrastructure and spotting cross-stack root cause problems quickly and efficiently. There is also no license or cost associated to either of InfoSight/VMVision. Hopefully the data and foresight they provide will make you look like a hero to your boss and peers alike!

Ensure the Support number to hand

What to check?

In 25 years of being in this industry, one best practice I have picked up is to have the Support number handy for all mission-critical products, you'll find the Nimble Storage global numbers here (they are manned 24 x 7 x 365).

Why is this important ?

It is a common best practice to follow a process and call Support to ensure that you've been through the process (rather than the first call into support being at a time of need and stress).

Top Tip: a good check is to enable the support tunnel (the checkbox underneath enabling AutoSupport) and then calling into Support to check they can contact the array okay (this also checks that Support can remotely connect to the array to provide world-class support when it's needed).

Review InfoSight Events are being reviewed

What to check?

It's important to frequently review your infrastructure to ensure performance, capacity and availability isn't likely to be compromised in the near future (a watched pot doesn't boil over). Fortunately this is relatively easy and straight forward with InfoSight.

Set aside a time to check the following:

Events Page (InfoSight > Wellness Tab)

Each row is a discreet event and details what requires your attention. Expanding the event, shows more details on the resolution and what is required. The events should be reviews, acknowledged and actioned. There is no need to contact Nimble Support unless you need assistance with the resolution.

Over 90% of our cases are identified automatically using InfoSight and over 80% of cases are resolved simply by following the actions below.

Capacity (InfoSight > Reports > Capacity Report)

This view allows you to monitor the capacity of each controller and to predict when capacity will be breached (based on historic growth trends). Infosight will alert when the capacity reaches 90% as it is at 95% the lack of space will start to impact performance.

Performance (InfoSight > Reports > Performance Report)

This view allows you to monitor the performance of each controller and to predict when performance will be breached (based on historic growth trends). In an Adaptive array there are two trends to monitor CPU (which equates to available IOPS), and Cache (which equates to predictable read latency). In an All-Flash array the only metric is CPU (Available IOPS).

Should any of the graphs be sustaining red regions then please contact your Partner SE and/or Nimble SE to discuss the performance characteristics in more detail.

Top Tip: Don't be alarmed if the array in the first couple of weeks alerts that it's running out of space or performance. It's quiet common when data is being migrated to the array for this to be an intensive data processing load and in addition the sudden increase of data capacity can fool the heuristics to predict the array is running out of capacity, performance (CPU) and cache. Give Infosight a week or two to normalise.

Running the latest GA Code

What to check?

By running the latest General Available Code you will automatically running the Nimble Support recommended Mission Critical code. Nimble OS only is awarded General Available status when tough and rigorous criteria is met with regards to uptime, critical bugs and run time in the field. It is the code we recommend for all mission-critical systems. Running the latest code also ensures you have the latest features available and the latest patch fixes.

You can check the whether your array is running the latest code by clicking the Nimble OS version in any of the InfoSight graphs or by running Administration > Software Update > Check Software in the Nimble GUI. InfoSight will also give you the valid install paths to get to the latest code (if several steps are available):

As all Nimble code upgrade are non-disruptive you should be able to upgrade to the latest GA release, with zero disruption by following the Software Update process in the Nimble GUI.

If when checking the software upgrade there is a red mark next to the code release. That is an indication that Support have black-listed your array to prevent a potential problem. If you see this we would recommend placing a call to Nimble Support to understand why the array has been black-listed.

Top Tip: I would also recommend before migrating applications to the array to perform a code upgrade using a workload generation tool to satisfy yourself that code updates are truly non-disruptive. This should be part of any customers commissioning testing plan.

Why is this important ?

It ensures that your array is running the most mission-critical and stable code available and is a mainstay of Nimble to achieve 99.9999% availability. You also get the benefit of getting all the latest code features and optimisations!

Nimble Connection Manager is a host host side utility that manages path management and connection management (in a Windows, Linux and VMWare environment). It is an essential piece of software when scale-out groups are deployed, but with a single array it allows the provisioning of volumes and path management to be accomplished in the most efficient and stable manner in that it ensures a number of best practices are automatically set rather than relying on users to remember to set options manually. There is a much more in-depth deep dive into the functions on Nimble Connection Manager for Windows and VMware at these links - Linux, Windows and VMWare.

Why is this important ?

This ensures that your hosts are fully optimised to use and integrate with Nimble Storage, which in turn leads to a much more stable and easier environment to manage.

Finally, are you Thrilled to be a Reference Customer?

What to check?

At Nimble, we take huge pride to make every customer thrilled to be a reference. If your not happy with setup then please let your Nimble SE know. We aim to make every customer exceptional happy!

Top Tip: Did you know if you recommend Nimble to a friend or a colleague (to an opportunity that was previously unknown to Nimble), then if they become a Nimble customer too then we will reward your referral with a free gift. Please consider introducing us, the referral page is here

Why is this important ?

We want our customers to be successful, ultimately looking after our customers will ensure our business thrives. This philosophy has served us well since the Nimble's inception in 2008!

Finally, I have attached a quick check list here, which provides a quick checklist for each of these items. Please ensure it's checked after every install!

If you have further items then you feel should be added to this list then please let me know and I will add them!

Thanks, Happy New Year and I hope you manage to stick to all your New Year's Resolutions!!!

rfenton4

Here are a couple items I like to check during my deployments, they're more focused on validating Host connectivity. This is after provisioning Volumes, connecting Hosts, adding data to the Volumes:

1. Check for MPIO functionality: From the Nimble Group's administration page, Monitor->Interfaces. Look at the data interfaces, they should be pretty close to identical usage if Nimble's PSP (vSphere) or Round Robin is being used. I'm pretty sure Least Queue Depth for Windows hosts will show similar usage.

2. Check for multiple connections from Hosts: From the Nimble Group's administration page, Monitor->Connections. A quick scroll through the 'Number of Connections' column is an easy way to quickly identify misconfigured Hosts, networking, or maybe bad connectivity. One thing to note, any recently connected Hosts may not populate all of their connections until some I/O has been generated so this might be a good follow up task.