The changelog includes this nugget, which I’ll expand upon here:
The Performance Schema now maintains statement digest information. This normalizes and groups statements with the same “signature” and permits questions to be answered about the types of statements the server is executing and how often they occur.

It is the Query Analysis view from MySQL Enterprise Monitor. This was first released at the end of 2008, and was conceived (actually in 2006, presented and much discussed on a snowy day at a MEM team meeting in Amsterdam) from the need to easily monitor what statements are running within the database instance.

Yes, there were the Slow Query Log and General Query Log, but they had their downfalls when we were evaluating this – not fine grained enough on the one side, not enough info with too much overhead on the other, and both require access at the OS level to read the log files – not always a problem, but certainly is in some cases. The only tool that did any kind of analysis and statistics aggregation on queries that MySQL had been executing, at the time, was mysqldumpslow.

And so MySQL Proxy was born, and hooked up with MEM to collect statistics on queries that were funneled through it. It was revolutionary (for the MySQL world, at least). Hot on it’s heels we saw new GUI tools from the community trying to do the same thing. We’ve also seen extensions to the old ways of doing things as well with the extended slow query logging, and command line scripts such as mk-query-digest being born (and re-born). We then extended Query Analysis in to the Connectors as well, to share the load of statistics gathering across application servers, and not have to funnel everything through MySQL Proxy (among other benefits).

These are all still great solutions today. Yet they all have one major downfall – they try and skirt the issue of the database instance … actually generating this data and having it available with a SQL interface. That’s what databases are supposed to be good at … right?

Strikingly similar, no? And it’s being returned by the database directly.It gets better.

One of the problems with trying to gather statistics on statement traffic from within the protocol stream (done by Proxy, Connectors, any tcpdump solution, etc.) is that there are certain statistics not exposed within the MySQL protocol – everything like the number of rows examined, temporary table usage, select type etc. that can be seen in a SHOW SESSION STATUS for example are practically impossible to correlate (we would have to inject that in to each connection to wrap each “real” statement).

The Slow Query Log does not have this problem – and the extensions there in the community have proved that it’s useful to be able to aggregate more at the statement level too.

Statement Digests most certainly do not have this problem. Here’s the current structure of the new events_statements_summary_by_digest table:

Before diving in to some of the use cases, first it’s worth describing exactly what a digest is.

As with Query Analysis in MEM, we track and aggregate statistics by normalizing the statements. Performance Schema does this in slightly different ways to how the MEM components do this, however they both follow roughly the same process, by working on a tokenized representation of the statement.

Performance Schema does this by tying in to the lexer within the MySQL server, which spits out the tokenized statement to be parsed, by recording it’s token output stream and doing some of it’s own normalization on top.

“long” is currently defined as 1024 bytes, set by PSI_MAX_DIGEST_STORAGE_SIZE in psi.h, unfortunately this is not configurable yet, however it should also be noted that this is on the normalized statement – so large string values etc. are not a concern here

MEM does not fold lists of values down, i.e. “INSERT INTO foobar VALUES (1,2,3,4,5)” will be normalized to “INSERT INTO foobar VALUES (?,?,?,?,?)” within MEM. Performance Schema does this to save memory, by discarding the value tokens along the way and using the above forms for normalization.

With that description out of the way, now let’s take a look at some the value that you can get from the data!

First, let’s get the obvious out the way:

A high level overview of the statements like Query Analysis, sorted by those queries with the highest latency

My test data is a little bland (this is a MEM schema monitoring a single MySQL instance), but you get the idea. You can instantly get a picture of your top statements, ordered in this case by latency, along with some extra statistics like whether it caused a full scan, their average latency, how many rows were both scanned and returned etc.

But you could slice and dice this data any way you want – it’s just SQL.

Here’s a few more use cases:

List all normalized statements that use temporary tables ordered by number of on disk temporary tables descending first, then by the number of memory tables.

All of these greatly show the flexibility of having this instrumentation directly within the server, accessible via the server itself. If you haven’t done it already now is the time to take a little breath, and then shout Hurrah.

Now, it’s worth noting that all of this is stored in memory. By default, the maximum number of rows in this table in 5.6.5 is 200, which effectively lets you monitor 199 normalized statements – as there is a row reserved for a NULL digest, which is a catch all for once this number of normalized statements have been generated – Performance Schema then stores the aggregated statistics for all other statement traffic whose statement digests do not match those already within the table within this NULL row.

In the MEM team we’ve found in practice that people often have many more than this – often 500+ different kinds of statements being executed. You can increase the number of digests that are tracked with the performance_schema_digests_size system variable. You can track how much memory this uses with the SHOW ENGINE PERFORMANCE_SCHEMA STATUS command (here I have the number of digests to track set to 400, which takes roughly 490KB of memory):

Any scripts or tools that are used to analyze this table will need to take this in to account.

We’re just scratching the surface of possibilities with this table at the moment. I have a number of things that I’d like to see added (like.. a summary of each of the major classes of lower wait events, if enabled, like “SUM_FILE_IO_WAIT”, “SUM_SOCKET_IO_WAIT”, “SUM_MUTEX_WAIT” etc.), and I’m sure many of you out there in the community will have your own ideas as well.

Welcome to the future of statement diagnostics within MySQL, we hope you enjoy it! Tell us what you’d like to see next!

9 thoughts on “MySQL Performance Schema Statement Digests”

This ” I have the number of digests to track set to 400, which takes roughly 490KB of memory” is actually less important IMO than the qeustion “How will it impact server overall performance to set a high number of digests?” (we can only agree, of course, that a few MB’s of memory usage is nothing to worry about today – we buy memory because we want to use it for good purposes!).

I realize that answers to the question about performance can be ambigious. It depends on what the server is doing, how busy it is, configuration and so on. It is probably possible to define a test to get what results you want. But anyway some benchmarks with some ‘de facto standard performance benchmark tool’ (Sysbench?) would be useful.

Increasing the number of digests to track should not really affect throughput on the instance itself. Performance Schema is non-locking as far as client access to the statistics (we don’t block other application threads etc.).

That’s not to say this instrumentation is without overhead – *all* instrumentation comes at a cost. But it should be insignificant compared to the overall latency of a statement itself (for this particular set of instrumentation alone).

For other general overhead benchmarks, my colleague Dimitri Kravtchuk has been doing some benchmarks, such as:

I was just experimenting with `digest` table summary, I loved the output 🙂

It would be great if we can get information about these columns in human readable form for `SUM_TIMER_WAIT`, `MIN_TIMER_WAIT ` ,`AVG_TIMER_WAIT `, `MAX_TIMER_WAIT `, `SUM_LOCK_TIME`.
Actually I have not understood what those values indicates? what form it is?

If it can tell me in simple terms for a query which has been executed 3 times
SUM of query execution time is : 30 secs, Min=7 secs Avg=11 sec etc.. unlike larger number it would be great.

All times in Performance Schema are displayed in picosecond granularity. The statements above show how you can convert these to readable values, such as:

SEC_TO_TIME(SUM_TIMER_WAIT/1000000000000)

The thing is, you want to format each of those differently in a lot of cases… The SUM of execution time may be large, so you’ll want to display it as above. But an AVG execution time may be milliseconds only, in which case you want to display only the milliseconds it took, perhaps.

The beauty of Performance Schema is that it doesn’t make any assumptions on how you want to read the data, it just gives it to you in the rawest most granular form, so that you can decide how you would like to view it.

Excellent page, bookmarked it..
I ran the below query and got the sql’s which are not using the indexes but these sql’s are complete, I want FULL TEXT of the sql’s.
How do I get it ?
SELECT IF(LENGTH(DIGEST_TEXT) > 64, CONCAT(LEFT(DIGEST_TEXT, 30), ‘ … ‘, RIGHT(DIGEST_TEXT, 30)), DIGEST_TEXT) AS query,
COUNT_STAR AS exec_count,
SUM_NO_INDEX_USED AS no_index_used_count,
SUM_NO_GOOD_INDEX_USED AS no_good_index_used_count,
ROUND((SUM_NO_INDEX_USED / COUNT_STAR) * 100) no_index_used_pct,
DIGEST AS digest
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_NO_INDEX_USED > 0
OR SUM_NO_GOOD_INDEX_USED > 0
ORDER BY no_index_used_pct DESC, exec_count DESC LIMIT 5;

Indeed, this table only has the normalized forms of the statements. If you want raw statements, you need to try and capture/profile them from the events_statements_* tables, which hold raw undigested statement stats (they all use the MD5 hash column to join with).