Monday, December 17, 2007

Although I have been arguing that performance monitoring and capacity planning require a decent server montoring environment, it also requires more. This extra part comes from the fact that often services depend on each other. A web service connects to a database (hosted on a different server) and fetches data from the file server (local, or SAN/NAS). Often, one part in the chain is a bottleneck for the whole of the process. This is a shame and can be avoided by careful analysis of the correlations between performance data.

Again, this is an argument in favor of what I called a 'load profile' earlier. By modeling a server by means of a load profile, we get a representation of that server in terms of measurable quantities. Statistics and mathematics in general can then help us analyze the correlations between those load profiles.

In a previous post, we talked about averages (types of averages) and peaks and how peaks can tell you something about the spreading (variance, standard deviation) of the data.

Information about peaks is required (especially in capacity planning situations) to understand the sizing of the platform you're running on. On the other hand, having a peak utilization of (say) 80% and an average of 20% still does not tell you that much: how long was the system running at high CPU levels? Maybe only for 10 seconds during the day (a scheduled database operation, backup procedures, etc.)? Is it crucial for our service that this high level of CPU can be guaranteed at that moment, or is it affordable to let the application/server wait a little longer for CPU requests? Think of a mail server, for instance, where it wouldn't be a big deal if the server would forward your mail a few milliseconds later or earlier (would it?).

Basically, what we need is a load profile for a server. A load profile contains information like:

Load during hour, day, week, month (or any other relevant period for this server)

Expected response times instead of observed response times (basically, a cutoff on the resources)

Current hardware inventory

Current 'scaled' hardware inventory (20% CPU usage is different for a quad core than a single core, a scaled inventory takes that into account and enables easy comparison of systems)

Friday, December 14, 2007

Now that we are into the topic of performance (of capacity) monitoring and planning, let us continue with something that has kept me busy the last couple of days: averages of performance data (and other statistical information) versus peaks.

This goes back to a classic textbook example in statistics, where the mean value of a series of data points is completely irrelevant as a representation of the data points itself. Let us consider the following example. Given a series of data like in the table below:

X Y A 1 B 2 C 4 D 7 E 100 F 4 G 9 H 7 I 3 J 5

These may, for instance, represent scores (0-100) given to students (by a very strange teacher). In a graph, this is presented below:

The red line represents the average value. It is clear that everyone (except the teacher's little friend with 100 points) is below the average. As a consequence, the average is not a good representation of the data as a whole. Some say it is too much influenced by the extreme values. In fact, this average (sum of the data values divided by the total amount of data points) is called the arithmetic mean. There is another notion of 'an average' which is called 'geometric mean'. In the example above, the value of it would be 5.4 which is much more relevant. The median would even better define the data set, but that would lead us too far.

Basically, the fact that the arithmetic mean does not give a good indication of the data set is caused by the large spread of the data. In statistics, there is another indicator for this: the variance, or standard deviation. It is a measure for how close or far apart the values are. In the example above, the standard deviation. In our example above, it would read 30.2. Suppose the value of E would be 10 instead of 100, the standard deviation becomes 3. In others words, the lower is the standard deviation, the more the data is 'close'.

The above brings me back to performance monitoring. Somehow, I want to summarize the performance data of a server by a small set of indicators (averages of time etc.) that give a reasonable picture of the actual performance or in other words: that are representable for the system's actual performance. Typically, when looking at the percentage a CPU is used over time, we see fluctuations that are similar to the figure above (don't believe me, below is an actual example of a system with some high peaks and further sitting idle for most of the time - this server has 4 cores by the way). We conclude that, therefore, it does not make much sense to look at simple averages of the performance counters in order to get an idea of the behavior of the system..

Coming back to VMware Capacity Planner: the tool keeps track of the average value (yes, the simple average that does not say much) but also internally uses the geometric mean but not the variance (as far as I can tell). From this perspective, the whole performance gathering using this tool would be worthless. Luckily, it also keeps track of peak values (and calculates averages over these peaks but that is another story). Comparing the peak values with the average tells us a lot about the spread of the data points. The system behind the graph above, has an average CPU value of less than 20% while the peak CPU utilization is higher than 90%! This tells us that the variance/spreading in data points is large.

In a virtualization/consolidation assessment, these cases have to be taken into account, as we do not want our systems to become unresponsive because they have their peaks at the same time. More about this and other topics later...

Everybody who is involved in the monitoring of systems will acknowledge that the most difficult aspects in monitoring a server (or set of servers) are:

Finding the proper indicators for the performance of the system (CPU usage, CPU cycles, memory usage, paging, etc.)

Making sure they are queried regularly, but not too much in order to avoid impacting the performance of the system by monitoring it.

Storing the resulting data

Summarize, create views, average, etc. (this also depends on what you want to know about the system)

Analyze, interpret, etc.

Did I say the most difficult aspects? Are there any other aspects? Well, not really... capacity monitoring (and planning as a further step) is not an easy task:

Are you aware of the utilization of your systems? Even of your workstations?

Would you have any idea how many of your servers could be placed on a virtualization platform with a specific set of hardware characteristics?

Would you know when your mail server had the hardest time managing mail boxes the last couple of weeks?

Probably the answer is 'no'. Maybe the answer is 'I don't care'?

Most companies do care, because of several reasons: cost, manageability, flexibility, scalability, environment, space, etc.

There are already some players on the market: VMware Capacity Planner (see earlier posts), PlateSpin PowerRecon, Veeam Monitor, etc. I'm mostly used to VMware Capacity Planner (VMCP) but recently, I have also evaluated both PowerRecon and Veeam Monitor. More about this later.

First a general remark about the update to version 2.6 of the tool: it was not always pleasant to use the web interface the last week because of downtime (scheduled maintenance), errors in the interface, latency, etc. It all feels normal again, and some of the new features where definitely worth it.

Back to business: What interests me most when it comes to downloading data from the website is the pure performance data. VMware averages this information over a week, so data can be downloaded on a weekly basis.

I'm not going into the full details of everything that can be tweaked, configured or queried but most of it will be clear when looking at the actual command-line to get a CSV file. Most important is understanding that some of the choices you make are based on the session that is open on te server, and are not selected using POST or GET. Some of these options can be configured on the html view, but not the export view (although the syntax is available) so we have to apply a trick: first get the html view for the type of data we want and then export the data.

Below, you find a typical command-line to download the core VMware statistics for week 45 of a company with ID 1234: