Month: March 2016

Recently, I decided to upgrade a database server from RHEL 6 (CentOS 6) to RHEL 7 (CentOS 7), which involves switching from MySQL 5.5 to MariaDB 5.5. Our server hosts about 100 databases, when I was testing them individually, I didn’t see any problem. However, when I ran the back up all databases one by one using mysqldump (i.e., running mysqldump command for each database, one after one, 100 times), something funny happened. Here is the error message:

This error message means the MySQL cannot access the file. If you google the message, you will notice that there are tons of solutions, and almost every of them suggests you to increase the open_files_limit variable in my.cnf.

Therefore, I checked my configurations (/etc/my.cnf), and I noticed that the value was already set to 30000. I also checked the lsof command and I found something very interesting. Notice that I have 100 database, each of them contains about 60 tables. Each table has about 3 files. Depending on the timeout settings, if all database and tables are opened, the total number of opened file will be 100x60x3 = 18,000

sudo lsof -u mysql | wc
1045 25811 239248

This result suggests that at the time of crashing, the mysql user (the system user that run the MariaDB service) was accessing 1045 files at the same time.

So I was scratching my head. Why I already set the open_files_limit value to 30000 already, and the system crashed at 1045th files? I also verified the memory (command: free) and current process (command: top), and I didn’t find anything unusual. One last thing, I checked the open_files_limit value using MySQL terminal, and this is what I found:

It seems that MariaDB didn’t honor the open_files_limit I set in config file, instead it uses the default one, which isn’t right. So after some investigations, I’ve noticed that RHEL 7 set up some security stuffs, such that you will need to set the open_file_limit variable at the system level rather than the application level. In the other words, whatever you put in the /etc/my.cnf, it won’t go through the security check at RHEL.

Here is how to set the equivalent open_files_limit at the system level:

One of the biggest selling points of RHEL is the stability. When we upgraded from RHEL 6 to RHEL 7 (clean install), we expected that everything should work fine without too much modifications. Unfortunately, what I saw is a broken system. I really don’t expect that this happens in an enterprise class product.

After I upgraded the CentOS / RHEL system to the latest kernel, the ZFS failed to start. The system was unable to load the ZFS module, i.e., I could not access my data. Here are some error messages I found on the system:

So what does these messages mean? Before I explain the details, let me explain how ZFS works on Linux. For legal reasons, unlike *BSD, Linux kernel does not support ZFS. In order to make Linux talks to ZFS, some people came up a very smart way: They inject the ZFS library at the kernel level, such that when Linux boots, it knows how to handle the ZFS. It sounds pretty ideal, isn’t it?

And now, we have a problem.

Many system administrators like to let the system upgrade automatically (such as running yum update -y in the cron job etc). Unlike *BSD, Linux bundles the kernel and application update together. In the other words, when you run the yum update, it will update both kernel and applications together, and there is no way for you to pick one and skip the other.

When the system upgrades the kernel, it refreshes everything, i.e., the new kernel will not know what is ZFS, because the process of injecting the ZFS happens when we install the ZFS on Linux. If there is no new version available, this process will not happen. So what happen after you reboot the computer, which by default, load the latest kernel? You got it, the ZFS won’t be loaded and your data is not accessible.

There are few ways to handle this. First, if you really want to keep your system up to dated (which I don’t recommend), exclude the kernel from the system update.

sudo nano /etc/yum.conf

[main]
.....
exclude=kernel*

It doesn’t mean your system is 100% safe from now on. You may still get some chances to break your ZFS. Here is some funny messages after I turn on the exclusion and run the yum update:

Loading new zfs-0.6.5.4 DKMS files...
Building for 2.6.32-504.23.4.el6.x86_64
Building initial module for 2.6.32-504.23.4.el6.x86_64
Done.
Adding any weak-modules
ERROR: modinfo: could not open /lib/modules/2.6.32-358.el6.x86_64/weak-updates/: Is a directory
ERROR: modinfo: could not open /lib/modules/2.6.32-504.23.4.el6.x86_64/zavl.ko: No such file or directory
FATAL: /lib/modules/2.6.32-504.23.4.el6.x86_64/zavl.ko: No such file or directory
Warning: Module zavl.ko from kernel has no modversions, so it cannot be reused for kernel 2.6.32-358.el6.x86_64
ERROR: modinfo: could not open /lib/modules/2.6.32-358.el6.x86_64/weak-updates/: Is a directory
ERROR: modinfo: could not open /lib/modules/2.6.32-504.23.4.el6.x86_64/znvpair.ko: No such file or directory
FATAL: /lib/modules/2.6.32-504.23.4.el6.x86_64/znvpair.ko: No such file or directory
Warning: Module znvpair.ko from kernel has no modversions, so it cannot be reused for kernel 2.6.32-358.el6.x86_64
ERROR: modinfo: could not open /lib/modules/2.6.32-358.el6.x86_64/weak-updates/: Is a directory
ERROR: modinfo: could not open /lib/modules/2.6.32-504.23.4.el6.x86_64/zunicode.ko: No such file or directory
FATAL: /lib/modules/2.6.32-504.23.4.el6.x86_64/zunicode.ko: No such file or directory
Warning: Module zunicode.ko from kernel has no modversions, so it cannot be reused for kernel 2.6.32-358.el6.x86_64
ERROR: modinfo: could not open /lib/modules/2.6.32-358.el6.x86_64/weak-updates/: Is a directory
ERROR: modinfo: could not open /lib/modules/2.6.32-504.23.4.el6.x86_64/zcommon.ko: No such file or directory
FATAL: /lib/modules/2.6.32-504.23.4.el6.x86_64/zcommon.ko: No such file or directory
Warning: Module zcommon.ko from kernel has no modversions, so it cannot be reused for kernel 2.6.32-358.el6.x86_64
ERROR: modinfo: could not open /lib/modules/2.6.32-358.el6.x86_64/weak-updates/: Is a directory
ERROR: modinfo: could not open /lib/modules/2.6.32-504.23.4.el6.x86_64/zpios.ko: No such file or directory
FATAL: /lib/modules/2.6.32-504.23.4.el6.x86_64/zpios.ko: No such file or directory
Warning: Module zpios.ko from kernel has no modversions, so it cannot be reused for kernel 2.6.32-358.el6.x86_64
depmod...
DKMS: install completed.

The second thing you will need to do is to increase the /boot partition from the default 200MB to at least 2GB. By default, RHEL will create a 200MB /boot for storing the kernel files. Kernels are small and they rarely go beyond 40MB. However, RHEL will only keep up to 5 recent kernels (40MB x 5 = 200MB), and it will remove the rest. So what happen if it removes the one that works with ZFS? The only thing you can do is to reinstall the system and import your ZFS again.

sudo zpool import

Here is how to modify the number:

sudo nano /etc/yum.conf

#Tell the system to keep the most 20 recent kernels
installonly_limit=20

Another thing you may want to do is to select the working kernel (instead of the latest) one when boot. Here is how to change it: