Thursday, March 26, 2009

Abstract - As described by Walter Heck, MySQL database size can be visualized using Google Charts. With a minor code improvement the URL for the chart can be obtained twice as fast. With some more modification, the number of lines can be cut down resulting in a function that is half as long.

A month ago or so I read a couple of posts about using the Google chart API to visualize database size. Although I personally would not consider using Google Charts (in its current form) for serious application monitoring applications, I am quite charmed by its ease of use and availability. Time to give it a try myself.

Although I applaud both Walter and Ruturaj's efforts in porting this code, I think their code can be improved still. In this short article I'd like to illustrate how a minor modification can double performance.

Code Analysis

Let's take a brief moment to analyze Walter's original code. I will cite a few fragments of his code. (I have made some simplifications and removed some distractions to make it as easy as possible to understand the logic of the code. If you find an error in my citation, please consider the possibility that it is my doing, not Walter's).

His program takes the form of a MySQL stored function. It accepts a few parameters to configure the chart (chart type and size) and outputs a fully functional URL which can be used to retrieve a .png image that shows the size (in MB) of all databases managed by the MySQL server. This is the function signature:

The stored function is implemented by retrieving size metadata from the information_schema.TABLES table. The size of each table's data and indexes are then added and summed per database. The results are traversed using a cursor loop.

Note that the figure is immediately represented as megabytes by dividing twice by 1024. (In other words, divide by 1024 to make kilobytes of the raw bytes, then divide by 1024 again to make megabytes out of kilobytes.)

During iteration, a list of database names and a list of database sizes are built through string concatenation. These are required to supply the data series and labels to the chart.

About Google Chart Data Formats

We just mentioned that the data points are not simply concatenated - rather, data size per database is expressed as a percentage of the total size of all databases. This is not just some arbitrary choice - it is in fact required by the Google Chart API. Here's a quote from the Overview of data formats:

Before you can create a chart you must encode your data into a form that is understood by the Chart API. Use one of the following formats:

Text encoding uses a string of positive floating point numbers from zero to one hundred.

...

For this reason, the code divides everything by v_data_length_total and multiplies by 100. In our code analysis, we did not discuss v_data_length_total because it was not really essential for the logic of the program. This is how it is computed:

Note that this query is very similar to the query used in the cursor declaration. In both cases, the information_schema.TABLES table is queried to calculate the combined size of table data and indexes. The main difference is grouping: in the cursor declaration a GROUP BY table_schema was used to calculate the size per database. In this case, grouping is absent and the size is calculated across databases, i.e. the combined size of all tables and indexes in the entire server.

Text encoding with Data Scaling

Now, as it turns out, the text-endoding used by this code comes in two flavours:

Text encoding with data scaling uses a string of positive and negative floating point numbers in combination with a scaling parameter.

So, as an alternative to 'pre-scaling' the values in the data series to a 100-point scale, we can also opt to let Google do the work, provided we pass this scale factor.

Because the code to calculate the scale factor is already present, we can simply remove the scaling computation inside the loop, and add the scale factor to the URL like so:

An alternative method to calculate the scaling factor

The problem with the current solution is that it has to do a separate query to obtain the total size. We already noticed that both used queries are quite similar, save for the grouping. Because we are already looping through all results, we can try to calculate the total ourselves. This would allow us to avoid doing the SELECT...INTO statement altogether.

We would need to add proper initialization for the v_data_length_total variable and add the following line inside the loop:

SET v_data_length_total := v_data_length_total + v_data_length_sum;

When comparing performance before and after this change, we can observe a dramatic change. On my laptop, the original function takes anywhere between 45 and 50 seconds. After incorporating these changes, the time is slashed by two and the function takes 'only' 22 to 24 seconds.

The fact that the second solution is twice as fast (~23 seconds instead of ~46) is not a coincidence: it's because we're doing half the number of queries (1 instead of 2). Don't get me wrong - this is still very slow. But this comes all down to poor information_schema performance. In this particular case, there is not much we can do to improve the code further to gain performance.

More performance improvements?

I am very much convinced the bottleneck in database size chart function is the query on the information_schema.TABLES table. To be more exact, it's the fact that we're accessing the DATA_LENGTH and INDEX_LENGTH columns. Just watch this:

As you can see, simply accessing DATA_LENGTH and/or INDEX_LENGTH causes the query to be slow.

Notice also that the time spent to execute this stand-alone query is about the same as it takes for the improved function to complete. Basically, this tells us the function spends all its time performing the query - timewise, the contribution of the remainder of the function, such as cursor traversal, creating the value lists and building the URL is simply negligible.

This means that we simply can't improve the performance of the database size chart function unless we can improve the performance of this query. Because we have no other general way of obtaining index and data size, this is the end of the line.

Other Improvements

Even though we probably can't improve the performance of Walter's function anymore, I still think it's possible to improve the code.

MySQL Cursors are also pretty slow. This is something you really start to notice when traversing tens of thousands of rows. Not that that really matters for building Google Chart URLs: Most likely, you will hit a limitation in either chart size or URL length, so you can't really gain a lot of performance by eliminating cursors for this particular purpose.

Still, I think that eliminating the cursor will help to make the code less complex, so lets try anyway.

Building Lists: GROUP_CONCAT()

We just explained how the cursor was used to build various lists of values. As it happens, MySQL supports the GROUP_CONCAT() function, which was designed especially for that purpose. GROUP_CONCAT() is an aggregate function, just like COUNT(), MIN() and MAX().

Like other aggregate functions, GROUP_CONCAT() can produce a single summary result on a group of rows. It does so by first concatenating its arguments for each row in the group, and then concatenating the per-row result for the entire group of rows, optionally separating row results using some separator.

Now, there is an important limitation with this function that causes many people to avoid GROUP_CONCAT altogether. Here's the relevant text from the manual:

The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024.

Basically, this says that if you don't take proper precautions, the lists you generate with GROUP_CONCAT may be truncated. This is obviously bad news! However, there is a very simple workaround, which is also hinted at in the documentation:

The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet.

So, the workaround is simple: By assigning the value of max_allowed_packet to group_concat_max_len wil allow the longest possible list of values. You may argue that this might still not long enough. However, that is a mooit point. Just read up on max_allowed_packet:

The maximum size of one packet or any generated/intermediate string.

In other words, no MySQL string will ever exceed its length beyond max_allowed_packet - the same limit holds for any other method of concatenating strings within MySQL, including cursor loops.

A cursor-less database size chart function

Without further ado, this is how I would write the database size chart function:

Here's a quick summary that points out some differences with regard to the original code:

Instead of accepting CHAR(1) for the chart type, an ENUM('bhs','p') is used to restrict the value to a listed type.

Instead returning a varchar(3000), this function returns LONGTEXT, effectively leaving it to the Google chart API to report a URL length limitation

Instead of a cursor, we use a single SELECT..INTO statement.

The query in the SELECT...INTO statement uses a subquery in the FROM clause which is functionally equivalent to the actual cursor in the original code. A small but important improvement is the addition of a condition to restrict the result only to base tables. This automatically excludes the information_schema and databases that contain only views

The outer query is functionally equivalent to the cursor loop and uses one GROUP_CONCAT() expression to obtain a list of values, and one GROUP_CONCAT() expression to obtain a list of labels. Here, we also calculate the scaling factor by taking the MAX of the database size.

To work around the truncation problem with GROUP_CONCAT, we set group_concat_max_len the the maximum practical value. We don't just set max_group_concat_len and leave it at that. We restore its original value at the end of the stored function.

Search This Blog

About Me

I'm Roland Bouman (@rolandbouman on twitter). I'm a software (web) application developer and I work on both the front end as well as the back end. I do data modeling, database design, ETL, Analytics, and Business Intelligence.