The Glass is twice as large as it needs to be

MariaDB 5.3/5.5 has introduced a new join type “Hash Joins” which is an implementation of a Classic Block-based Hash Join Algorithm. In this post we will see what the Hash Join is, how it works and for what types of queries would it be the right choice. I will show the results of executing benchmarks for different queries and explain the results so that you have a better understanding of when using the Hash Join will be best and when not. Although Hash Joins are available since MariaDB 5.3, but I will be running my benchmarks on the newer MariaDB 5.5.

Overview

Hash Join is a new algorithm introduced in MariaDB 5.3/5.5 that can be used for joining tables that have a equijoin conditions of the form tbl1.col1 = tbl2.col1, etc. As I mentioned above that what is actually implemented is the Classic Hash Join. But its known as Block Nested Loop Hash (BNLH) Join in MariaDB.
The Classic Hash Join Algorithm consists of two phases, a build phase and a probe phase. Let’s consider the case of joining two tables on a equijoin condition. So the first thing would be to designate the smallest of the two tables as the left operand and the other table which is bigger, to be the right operand. Now when the algorithm begins, the first phase is the build phase, in which a hash table is created over the join attributes and rows of the left operand. Next comes the probe phase, which is where the matching rows from the right operand are found, by scanning the right operand and for each row scanned performing a lookup in the hash table by using values of the columns participating in the equijoin condition. The hash table is accessed by using a hash function on the values of the join condition, and hence is quite efficient. But what about the restriction on the size of the hash table. The size of the hash table is restricted by the value of join_buffer_size, and so if the left operand is big such that the size of the hash table built on it is greater than the join_buffer_size, then multiple hash tables would be created. For example if the left operand has “n” rows, and its size is three times the value of join_buffer_size, then 3 hash tables would need to be created each containing a hash table on n/3 rows. And so both the build and probe phase would be done three times, which means that the right operand will be scanned thrice.

Do a final scan of the probe input S and add the resulting join tuples to the output relation

Now after the explanation of the hash join lets see how it performs for different test cases.

Benchmarks

For the purpose of the benchmarks I used the DBT3 dataset of scale factor 2, which means the total dataset size is 4.8G. Let me show the breakdown of dataset size by the tables that I have used in the benchmarks:
Table ‘lineitem’: 3.8G
Table ‘supplier’: 11M
Table ‘orders’: 468M

I have benchmarked two different kinds of workloads, IO bound and in-memory. Benchmark on IO bound workload was performed with a buffer pool size of 1G, while benchmark on in-memory workload was performed with a buffer pool size of 6G. The benchmarks compare Block Nested Loop (BNL) Join of MySQL 5.5.24, Batched Key Access (BKA) Join of MySQL 5.6.5 and Block Nested Loop Hash (BNLH) Join of MariaDB 5.5.20. The configuration used with the three variants of MySQL are listed below.

Configuration

Let’s first take a look at the configuration used with different MySQL flavors.

Note that MariaDB includes a new variable ‘join_cache_level‘, this variable controls which Join Algorithms are allowed to be used, a value of 4 here means that Nested Loop Join and Hash Join algorithms are allowed. Now as well know that ‘join_buffer_size‘ controls the size of the join buffer allocated for each join in a query, MariaDB introduces another variable to control the size of the buffer ‘join_buffer_space_limit‘. This variable controls the maximum allowed size of the buffer for the whole query. By default it has a value of 1024*128*10, which means that your effective join_buffer_size is not bigger than this value. Hence, the reason I have set join_buffer_space_limit=32M.

Benchmark Machine Specs

The machine that I used for the benchmarks, is a dual core machine with the following CPU configuration: 2xIntel(R) Core(TM)2 CPU 6600 @ 2.40GHz. The amount of memory installed is 8G and the MySQL datadir is on a 4-disk Software RAID5 volume, the disks are 5.4K RPM disks. The filesystem used is XFS, and the OS installed is Centos 6.2

Table Structure

Before moving on, let’s take a look at the structure of the tables involved in the benchmark tests.

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 32077 seconds to finish in the IO bound workload. Anyhow we can clearly see from the above chart that Hash join comprehensively beats BKA and BNL, hash join is perfect in these cases where you are joining a small table with a very large table with no ‘indexed where’ conditions on the big table. BNLH takes half the time to complete the query for in-memory workload and 6.6x less times as compared to BKA MySQL 5.6, and 965x less time as compared to BNL MySQL 5.5. So hash join gives us an improvement by a very large factor both for IO bound workload and in-memory workload.

Test Case B – Join a small table that fits in memory to a large table with a selective WHERE clause on an indexed column

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 1280 seconds to finish in the IO bound workload. In this test Hash join is not ideal because you have a highly selective where clause that reduces the size of the joining data set. And hash join performs even badly and takes 7x more time for in-memory workload in this test case. While for IO bound workload, hash join takes 53x less time to execute the query as compared to MySQL 5.5 but takes slightly more time as compared to BKA algorithm of MySQL 5.6.

Test Case C – Join a small table with a large table with a WHERE clause on a non-indexed column

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 31654 seconds to finish in the IO bound workload. Again here hash join beats BKA and BNL comprehensively. Hash join outperforms the other join types when you are joining a small table with a very large table with a where clause on a ‘non-indexed’ column In this test we can clearly see that Hash Join gives a lot of reduction in query time. The reduction in query time for IO bound workload is 1266x times when compared to MySQL 5.5 and 9x times when compared to MySQL 5.6. While for in-memory workload the reduction is query time is 3.5x when compared to both MySQL 5.5 and MySQL 5.6.

Another interesting thing to note is that for both Test B and Test C, Hash Join takes similar amount of time both for IO bound workload and in-memory workload. Why, because Hash Join implies scanning the table lineitem (right operand) in both test cases. Since in Test B we have a limited set of rows in the supplier table (left operand) to join to the lineitem table (right operand) so scanning the lineitem table (BNLH) proves to be costly as compared to doing batched index lookups (BKA). However, in Test C the cost of hash join remains the same but the cost of BKA increases, as there are going to be a lot more random index lookups needed to be performed because of the increase in the number of rows needed to be joined from supplier table (left operand).

Test Case D – Join a large data set (>1M rows) from one table with a large table

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

Here we can clearly see that MySQL 5.5 beats both BKA of MySQL 5.6 and Hash Join of MariaDB 5.5. In IO bound test MySQL 5.5 takes 2.5x less time to complete the query as compared to MySQL 5.6 BKA algorithm, and takes 6.6x less time as compared to MariaDB 5.5 Hash Join, Hash Join also performs worse as compared to BKA and takes 2.5x more time. While for in-memory workload test, MySQL 5.5 takes 5x less time as compared to MySQL 5.6 and 13x less time as compared to MariaDB 5.5 Hash Join. First thing to note is that the above query would be reading 1/3 the number of rows in the table orders (left operand), and so MySQL 5.5 prefers to do a PRIMARY index scan of the table orders resulting in sequential IO, while both MySQL 5.6 and MariaDB 5.5 prefer to do an index range scan on the secondary key o_orderdate which results in random scans of the PK to fetch the columns that are not part of the secondary key. Even though MySQL 5.6 uses MRR to offset the effect of random access of PK, even then it proves to be costly. Also note that the table lineitem, is joined by the column l_orderkey which is the left-most PK column, so reading the table orders in PK order has another benefit that it implies reading the table lineitem in PK order. Hence, these benefits mean MySQL 5.5 wins. But why does Hash Join take so much more time. The reason is that the rows needed to be read from the left operand which is the table orders are far greater than the size of the join buffer. The size of the join buffer is 32M, while the size of the left operand is 186M which means roughly 6 scans of the right operand which is the table lineitem. Hence the reason why hash join is slow in this case, because we have to refill the join buffer with rows from orders table many times, and hash join is not that good if you need many scans of the right operand (in this case the table lineitem).

Another difference with the query in this test case is that, while with the queries in previous test cases, the joining key from the table supplier would match approximately 600 rows from the table lineitem for each distinct key value, in this test case D, the joining key from the table orders would match approximately 5 rows from the table lineitem for each distinct key value. Also the joining key in this test case D is PK in one table and left-most part of the PK in the second table.

How does optimizer work with the different Join Algorithms available?

Currently, the part of the optimizer that is responsible for choosing the join algorithm for a particular query and QEP is not advanced enough and there is work to be done yet. As I understand it MariaDB folks are working on the cost-based choice for any joins. It’s not easy because the current costing model is primitive and must be enhanced to support the possibility of existence of different join algorithms. So what does that mean to MariaDB/MySQL users right now with the state of the current optimizer. Right now you would have to manually enable and disable the join algorithms for the optimizer to choose from.
In MariaDB, every algorithm has a number given to it:1 – flat BNL
2 – incremental BNL
3 – flat BNLH
4 – incremental BNLH
5 – flat BKA
6 – incremental BKA
7 – flat BKAH
8 – incremental BKAH

The variable join_cache_level controls which algorithms are enabled. If join_cache_level=4 all algorithms numbered 1 to 4 are enabled, if join_cache_level=8, all algorithms numbered 1 to 8 are enabled. Optimizer is naive in the sense that it always uses the max values join algorithm. If join_cache_level=4 it always uses BNLH (hash join), if join_cache_level=8 it always uses BKAH (a variant of BKA). Optimizer does not try to check which algorithm is the best one to use, it just assumes that the algorithm with the highest numeric value is the best one.
So we can force the join algorithm used by setting appropriate values of “join_cache_level”. For example in my test I forced the optimizer to use hash join by setting join_cache_level=4. We can set certain rules for which certain join algorithms are best and then use that algorithm by making use of the variable “join_cache_level”.

Conclusion

Based on the above information and the benchmark results for different test cases, we can see where Hash Joins work best and where they don’t. First of all Hash joins only work for equijoins. Hash join work best when you are joining very big tables with no WHERE clause, or a WHERE clause on a non-indexed column. They also provide big improvement in query response time when you are joining tables with no indexes on the join condition (Full Join). The best performance with Hash Join can be achieved when the left table can fit completely in the join buffer, or when the least amount of buffer refills are needed, as each buffer refill means a scan of the right-side table. However, Hash joins do not outperform BNL or BKA when you are joining a really small subset of rows, as then scanning the right-side table becomes costly in comparison. Block Nested Loop Join would perform better than Hash Join when you are joining two tables on a PK column such that both tables are read in PK order. One use case that I can think of for hash joins is data warehouse applications that need to run reporting queries that need to join on lookup tables which tend to be small mostly. What use cases can you think of?

Have you wanted to compress only certain types of columns in a table while leaving other columns uncompressed? While working on a customer case this week I saw an interesting problem where a table had many heavily utilized TEXT fields with some read queries exceeding 500MB (!!), and stored in a 100GB table. In this case we were not allowed to make any query or application logic changes so we chose to implement the Barracuda file format and utilize compressed rows as this appealed to me for this mostly-read application. One quick way you can see if your rows will benefit from compression would be to read Peter Zaitsev’s blog post and execute:

In our case we saw about a 75% reduction when the TEXT field was compressed which we felt indicated there would be a benefit derived from table compression.

With the original InnoDB Antelope file format you have the choice of ROW_FORMAT=COMPACT and ROW_FORMAT=REDUNDANT where InnoDB stored the first 768 bytes of variable length columns (BLOB, VARCHAR, TEXT) in the index record, and the remainder stored in overflow pages. COMPACT became the default after MySQL 5.0.3 and has a more compact representation for nulls and variable-length fields than REDUNDANT.

Using InnoDB’s new Barracuda file format (available since InnoDB plugin 1.1 or MySQL 5.5) you can now leverage table compression by specifying ROW_FORMAT=COMPRESSED. In our case we only wanted MySQL to try to move the larger (greater than 16KB) TEXT fields off-page so we utilized the KEY_BLOCK_SIZE=16 directive. This means that each TEXT / BLOB field that exceeds 16KB it would be stored in it’s own page (less the 20 byte pointer stored in the index page). Based on our analysis 75% of the blobs stored in the table were over 8KB, which were responsible for 90% of space usage hence compressing only externally stored blobs provided substaintial advantages. Why did we choose a KEY_BLOCK_SIZE that is the same value of the InnoDB page size of 16KB? As the fine MySQL manual states:

This setting may still be useful for tables with many long BLOB, VARCHAR or TEXT columns, because such values often do compress well, and might therefore require fewer “overflow” pages.

I did not test with a smaller KEY_BLOCK_SIZE as we had minimal time to effect the table compression modification (given the long run-time of the ALTER TABLE), you may find your application benefits from a different KEY_BLOCK_SIZE value. Also note that you need to enable the dynamic variable innodb_file_format=BARRACUDA (don’t forget to set it in my.cnf!):

SET GLOBAL innodb_file_format=BARRACUDA;

One caveat: you must be running with innodb_file_per_table=1 as the InnoDB system tablespace cannot be compressed, see this page for further details on how to enable compression for a table.

To utilize Barracuda format tables you will need to create them new and migrate data, or affect existing tables with an ALTER TABLE statement. As table compression is table specific, ROW_FORMAT and KEY_BLOCK_SIZE directives are passed via CREATE TABLE or ALTER TABLE statements. In our case, we chose to re-build the table using ALTER TABLE via a null-operation like this:

In our case even though the customer had a 3GHz 24-core machine the ALTER TABLE was progressing slowly as it was bound to a single CPU while compressing the data. Just have patience. Keep in mind too that if you started with a 100GB table and assuming you know your approximate compression rate, you will be left with a considerably smaller on-disk footprint so ideally you will be able to postpone that purchase of additional disk capacity.

So what was the real-world outcome of this exercise? We were able to show a 70% improvement in queries against this table when the TEXT fields were not part of the query request due to Barracuda not storing 768 bytes of the blob on field, and reduce the table down to 30GB. Happy customer

There is a great conference – SELF (South East Linux Fest) taking place in June 8-9 in Charlotte,NC
This conference has a great lineup for MySQL with 2 tracks on Friday dedicated to it. There are many great speakers both from Oracle and community on the list.
From Percona I will speak about Optimizing MySQL Configuration and Baron will talk about Forecasting MySQL Scalability from TCP Traffic.
See you there !

If you follow the general advices to create secure password the following ones seem to be secure, right?

s11P$||!sh&2

pr0&!!ke0

3kj39|!381

The answer to the question is, “it depends on how you use them”

Notice that these passwords all contain multiple exclamation points and ampersands which are normally special characters for your shell. The people tend to copy and paste them directly to the terminal but that can lead to some non-predictable behavior and therefore cause big problems depending on the character combination.

In this last example, !381 asks Bash to run the command with number 381 in the history, that is: “/etc/init.d/mysql restart”

These are only a small number of examples. Of course, I’m not telling you to use simple passwords. The lesson that we can learn here is “don’t paste your password on the shell“. First because they’re going to be logged to Bash’s history file and second because some combinations can cause big problems.

In order to avoid these problems:

Don’t run commands from the root account. There are no small mistakes when you make them from root.

Use .my.cnf in your home directory to avoid typing a password. Protect the file with permission mode 0600.

Use the -p option with mysql and its utilities. This makes the programs prompt you for a password from the tty.

Based on MySQL 5.1.63, including all the bug fixes in it, Percona Server 5.1.63-13.4 is now the current stable release in the 5.1 series. All of Percona‘s software is open-source and free, all the details of the release can be found in the 5.1.63-13.4 milestone at Launchpad.

Bugs fixed:

Building Percona Server with the Clang compiler resulted in a compiler error. Bug fixed#997496 (Alexey Kopytov).

During the last few weeks, it has become a little silent here at Use The Index, Luke! Today, I’d like to tell you the news that caused the silence and give a short outlook about the future of Use The Index, Luke!

The most important news is that I became father. It’s very important to me, at least. My son, Marcel, was born on 25th March 2012. So, we already had two month to settle down. Mommy is quite good at the night shift—I can sleep almost normally and have enough time and energy to work during daytime.

Most of that time I’ve spent to complete the book. Yes, it is “done”. As done as an online IT book can get. That means, the originally planned content is—after two years—finalized. The German translation was completed at about the same time. Use The Index, Luke! is now fully available in two languages.

During these two years of writing, I have always strived for an exciting text, because I know database tuning is not a popular topic for developers. I designed the structure so that one topic leads to the next in the hope it is read from cover to cover in one shot. The problem with the website is, however, that most people just see one or two pages. Those pages are often are often out of context and don’t make sense on their own. To cut a long story short: I believe the text only works when read entirely. So, I decided to publish the book as a real book.

At the moment, there is only a German edition available (to my own surprise). That is because I found some confusingly written paragraphs during translation. Definitely not good enough for printing. Those parts were not just translated, but rewritten from scratch. In two languages—at least at the beginning. Due to time constraints, I had to stop editing the English text after a while. Just writing in my native language was, of course, faster.

You can buy the German edition directly via http://sql-performance-explained.de/ or—private consumers in Germany—also in my amazon.de-Shop. Delivery time is the same in both cases because I’m always using amazon’s delivery network for Germany—no matter which way you order.

Finally, I’ll give you a small outlook what’s happening next. I’m currently completing the English text, of course. I’m looking for a lecturer, btw; A native speaker, preferably with German and IT knowledge (contact). Once the text is finalized, printing is no big issue. After that, I need to take care to earn some money again. From business perspective, this year was a disaster until now. Not only because I spent quite some time on the book, but also to prepare the arrival of my son. I finally took drivers education, for example. One would not believe how much time that takes.

Earning money is for me mostly about doing workshops and Instant-Coaching. I’m constantly getting inquires for the workshop. The problem is, however, the travel time and expenses. That makes it too expensive very often. As workaround, I try to get three (or more) clients in the same area—in that case, the travel time and expenses is no problem anymore. So, if you would like to hire me for a workshop, just leave me a short note. For the planning, I just need a postal code. The book is part of the course material, of course. At the moment, I’m only doing on-site workshops in the D-A-CH region. I’m, however, getting inquires from all over the world (thank you!). But that is even harder to manage. For those of you living outside Austria, Germany or Switzerland, please consider using the Instant-Coaching model. It is a virtual workshop, using your system. I can explain the principles based on examples from your own application, if you like.

If I find some time in between, I’d like to process my blog backlog. I’m having topics for about 20 articles in the pipeline—some of them for the miserably neglected myth directory. I can also imagine writing some opinion articles. I’ve always aimed to stay objective when writing for the book. Now, I might write possibly some…well…more emotional articles.

I did not yet make up my mind regarding long-term planning for Use The Index, Luke! I will, of course, maintain and update the material. Let me know if you have any topic you would like to see on Use The Index, Luke! Your wish might come true.

Whenever I learn about a new technology, I typically want to know the major administrative touch points more than I want to know exhaustive detail about every configuration option. Give me the gist, show me enough to get started, and give me a link to the manual.

XtraDB cluster (and Galera, the technology on which it is based) has been attracting a lot of interest in the community and we want to start presenting information about both what we know is essential, but also what we suspect will become essential as more and more production experience is logged with these important new technologies.

I used to think one should never look at max_binlog_size, however last year I had a couple of interesting cases which showed that sometimes it may be very important variable to tune properly. I meant to write about it earlier but never really had a chance to do it. I have it now!

One of our customers was complaining that the database would lock up at random times and then it would go back to normal in just a few seconds. This was MySQL 5.0 running MyISAM/InnoDB mix, not heavily loaded. We used pt-stalk (at that time it was aspersa stalk) trying to figure out what is happening, however all we found was a spike in writes, many queries piled up and looking at the system process list it was quite obvious that page flush daemon was acting out. I/O Pattern was rather awkward – here is an output from pt-diskstats (again, aspersa diskstats at that time):

You can see that (a) reads are affected – response time dropped significantly and (b) the reason appears to be enormous amount of writes. For a while I was really puzzled, looking for answers in MySQL status counters, innodb status, memory info and what not. It seemed that the problem was gone before pt-stalk would kick-off and all I could see was symptoms. Indeed the case it was. I was not allowed to run pmp or oprofile on that system due to potential performance impact. However, somehow I have noticed that pt-stalk kick-off time matched that of binary logs modification time:

Apparently, every time current binary log gets close to max_binlog_size in size, new binary log is created and, if expire_logs_days is not zeroish, log files older than expire_logs_days are removed. &LOCK_log mutex is held for the duration of the binary log removal and if your file system is slow at deleting files (both customers systems ran on ext3, which can be very slow deleting large files), then write operations in MySQL will suffer since that very same mutex is used inside MYSQL_LOG::write() call. Even though it only lasts 1-2s, it still gives users unstable response time – here’s how that looks on one of the customer system – this one was doing ~10k qps at a time:

A small blip, yet over 100 users are backlogged for a couple of seconds – not great user experience, considering this happens every 10-15 minutes.

Solution?

Here’s what we did: we have reduced the size of binary log file from default 1GB (some systems have it set to 100MB in their default my.cnf) down to 50MB and we never saw this problem ever again. Now the files were removed much more often, they were smaller and didn’t take that long to remove.

Note that on ext4 and xfs you should not have this problem as they would remove such files much faster.

Every time I have a conversation on SSD, someone mentions btrfs filesystem. And usually it is colored as a solution that will solve all our problems, improve overall performance and SSD in particular, and it is a saviour. Of course it caught my curiosity and I decided to perform a benchmark similar to what I did on ext4 filesystem over Intel 520 SSD.
I was prepared for surprises, as even on formatting stage, mkfs.btrfs says that filesystem is EXPERIMENTAL. In case with filesystems I kind of agree with Stewart, so question #1, what you should ask deciding on what filesystem to use, is “Was this filesystem used in a production more than 5 years?”, so from this point, btrfs has a long way ahead.
How you can get btrfs? Actually it is quite easy if you are on CentOS/RedHat/Oracle Linux 6.2.
Oracle provides Unbreakable Enterprise Kernel, which includes btrfs, so you can get it with this kernel. And installation is quite easy and straightforward, just follow instructions.

So, to numbers. Workload and benchmark are exactly the same as in my previous benchmark, and I perform runs only for 10 and 20GB buffer pool, as it is enough to understand picture. The previous run was done on ext4, so if we repeat the same on btrfs, it will allow us to compare the results.

I format btrfs with default options, and mount it with -o ssd,nobarrier options.

Throughput results:

We can see that btrfs not only provides worse throughput (5x!), but it is also less stable.

Response time:
The same happens with response time. Actually 95% response time is about 10x worse with btrfs.

And response time, timeline:

We can see that btrfs is very far from providing a stable response time.

I guess the conclusion is obvious, and I think it is fine for a filesystem that is in the EXPERIMENTAL state.
Most likely it is some bug or misconfiguration that does not allow btrfs to show all its potential.
I just will consider all talks of btrfs characteristic as premature and will wait until it is more stable
before running any more experiments with it.