InsightIQ – Isilon Monitoring Extravaganza

Once your Isilon cluster is up and running you’ll want to keep an eye on it. A piece of software that’s extremely useful to monitor both performance and capacity usage is InsightIQ. Very easy to set-up, it’s extremely powerful both in pro-active and reactive monitoring scenarios. Either sit back and watch the scheduled reports land in your mailbox or take a more active approach and drill down to find the source of a performance problem. Let’s explore further!

InsightIQ

InsightIQ is a performance monitoring and reporting tool for EMC Isilon systems. It expands the basic monitoring features of the Isilon clusters with additional metrics in more detail. It is a licensed product offered by EMC for which you will need a license for every cluster that you want to monitor. Do not despair: it’s totally worth it!

The installation of InsightIQ is straight-forward: you download a virtual appliance from EMC, upload it to your VMware farm and configure it (give it a name, IP address, etc.). This appliance will build a repository of monitoring data which can either be stored in the appliance itself or on an Isilon cluster using an NFS share. Recommendation: put it on an Isilon! Next, enable the insightIQ user on the Isilon clusters you want to monitor, click the add cluster button in InsightIQ and you’re (pretty much) done! It will start polling the cluster(s) periodically and collect the data for future analysis on whatever you want.

Performance monitoring with InsightIQ

So let’s start with the most exciting monitoring: performance. Everybody likes performance! It’s our job to make sure the system is performing properly and will continue to do so in the future. We’ll need two types of monitoring: pro-active monitoring to make sure we’re not hitting any limitations in the future, and also reactive, on-demand monitoring to find out why something is going or went wrong. InsightIQ can do both and is fast at it.

You’ve already seen the opening screen in the screenshot at the start of this post. When opening the InsightIQ dashboard it’ll immediately show you the most important facets of the Isilon: capacity, # of connections, CPU utilization and various types of throughput. If you want more information, you’ll want to move over to the Performance Reporting tab.

The screen above a very small subset of statistics that are available under performance reporting. It’s possible to adjust both the zoom level in the graphs (anything between 30 minutes and a year) or to select a different time period. What’s pretty apparent is the speed with which these graphs are displayed. Change a setting and the graphs are updated to reflect the new data in no-time at all: <1 second in most cases.

If you want to drill down, that’s easy. For example, looking at the protocol operations graph, we can see some peaks. Let’s find out what those peaks are!

First of all let’s break out the protocol operations per client. This shows us three active clients: the top one is the InsightIQ machine itself storing monitoring data on this Isilon, the second is a PACS archiving system and the last one is an entirely different machine. Fun detail is that, initially, InsightIQ will display the IP addresses of the clients, but will quickly replace them with the fully qualified domain name (FQDN) after looking them up in DNS. This should make your life a bit easier since a name of a machine usually tells me a bit more than an IP address…

As we can see, the InsightIQ machine generates the majority of protocol operations and the PACS machine is the one generating the peaks. This makes sense since InsightIQ is constantly polling and the PACS system is only generating data when an image is requested or stored. But let’s dig deeper. I want to know what the PACS client is doing so I click on the little green plus-sign next to it to display only the protocol operations relevant to that client.

Data is now filtered, the graph is updated and I can again breakout the data. In this case I’m interested in the various Op Classes. I can see that the peaks during the day are mostly namespace_reads and reads. At the end of the business day there’s a burst of writes to the system. This makes perfect sense since during the day medical images are viewed by the radiology staff and after 6 pm the images created during the day are flushed to the archive (= Isilon).
Why would I want to view this data? Namespace operations are queries or changes in metadata. This is the type of operations that would benefit greatly from SSD drives in the Isilon. So if I was running into performance issues and seeing a bizarre amount of namespace operations, I might consider adding nodes that have SSD drives installed so I could enable the Global Namespace Acceleration feature.

At this point I can continue to breakout till I run out of metrics. In this case I chose the node option, which shows me which nodes service the namespace_reads requested by the PACS host. Again: graphs are updated in a second or so, so I can troubleshoot quickly and efficiently. This is extremely useful: graphs that take tens of seconds to update are a pain in the behind and slow down troubleshooting considerable. The ability to clickety-click through all the available information allows you to do some big data analytics yourself and maybe find the source of your problems a bit sooner. Or at the very least save you the frustration of waiting for a page to load!

File System Analysis

Apart from performance troubleshooting there’s the option to analyze what the Isilon is actually storing. How many files are stored in which folder, how large are the files, when were they last modified, etc. Data for these statistics comes from the FSAnalyze job on the Isilon clusters, so make sure it’s scheduled and running/completing!

I can generate graphs showing current or historical statistics of the filesystem or I can generate a comparison between two points in time. Again: anything that’s underlined and blue can be clicked to drill down!

For example, this is a comparison of today against a week ago. I can see that during the last week 7 directories are created under the /ifs/data folder, with a total of 8460 files and 64.2GB of logical data. Add the necessary protection overhead and this results in 97,5GB of physical data. Moving to the next graph, the majority of files is 1-10MB in size, with files 10-100MB in size being the runner-up. And of course I can breakout all these graphs again.

Of course you can also schedule these kinds of reports, resulting in a nice PDF report in your mailbox showing you just the information you want. Use it to perform your daily performance check, your monthly capacity analysis, etc. Let us know how you use InsightIQ! What scheduled reports are you using, which metrics do you find important in reports, etc.