Enthusiasm never stops

The Amazon Elastic File System (EFS) is a very intriguing storage product. It provides simple, scalable, elastic file storage for use on an EC2 virtual machine. The file system can be mounted over NFS at one or more EC2 machines simultaneously, and it also supports file locking.

Here are some important facts which I found out while doing my tests:

I/O operations per second (IOPS) are not the same metric that we’re used to measure when dealing with block devices like HDD or SSD disks. When working with EFS, we measure the NFS I/O operations per second. These correspond 1:1 to the read() or write() system calls that your applications make.

The size of the issued I/O requests are another very important metric for EFS. This is the real bytes transfer between your EC2 instance and the NFS server.

Therefore, we’re limited by both the NFS I/O requests per second, and the total transferred bytes for those NFS I/O per second.

NFS I/O requests smaller than 4096 bytes are accounted as 4096 bytes. Regardless if you request 1 bytes, 1000 bytes, or 4096 bytes, you will get 4096 bytes accounted. Once you request more than 4096 bytes, they are accounted correctly.

You need more than one reader/writer thread or program, in order to achieve the full IOPS potential. One writer thread in my tests did 130 op/s, while 20 writer threads did 1500 op/s, for example.

The documentation says: “In General Purpose mode, there is a limit of 7000 file system operations per second. This operations limit is calculated for all clients connected to a single file system”. Our tests confirm this — we could do 3500 reading or 3000 writing operations per second.

CloudWatch has different aggregation functions for the *IOBytes metrics: min/max/average; sum; count. They represent different aspects of your EFS metrics, namely: the min/max/average IO operation size in bytes; the total transferred bytes in a minute (you need to divide to 60 to get the “per second” value); the total operations in a minute (you need to divide to 60 to get the “per second” value).

The CloudWatch EFS metrics “DataReadIOBytes” and “DataWriteIOBytes” reflect exactly what we see on the Linux system for “kB/s” and “ops/s” by the nfsiostat program. The transferred bytes reflect exactly the used bandwidth on the Linux network interfaces.

The “Metered size” in the AWS Console which is the same value as what you see by the “df” command is not updated in real-time. It could take more than an hour to reflect the real disk usage.

There is plenty of initial burst credit balance which lets you do some heavy I/O on your freshly created EFS file system. Our benchmark tests ran for hours with block sizes between 1 byte and 10k bytes, and we still had some positive burst credit balance left at the end.

I’m using the default NFS settings by the NFS mount helper provided in the “Amazon Linux 2” OS:

I created 40 different files, so that I can run 40 different single benchmark programs on an EC2 instance – one for each file. This increases concurrency and lets the total throughput scale better.

Sequential writing and reading

Sequential writing and reading performed as expected – up to the “PermittedThroughput” limit shown in the CloudWatch metrics. In my case, for such a small EFS file system, the limit was 105 MB/s.

Writing: NFS I/O operations per second

Here are the results:

Writing from one EC2 instance using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 2000 IOPS. Typically the IOPS are between 1400 and 1700.

Writing from two EC2 instances using 1 byte, 1k bytes, or 10k bytes: regardless of the request size, we get up to 3000 IOPS in total which are equally spread across the two EC2 instances.

The “PercentIOLimit” CloudWatch metric shows 84% when we do 2880 ops/s, for example. Therefore, the total IOPS limit for writing is about 3500 ops/s.

When doing only write() system calls with 1 byte data, only “DataWriteIOBytes” is accounted by EFS which is an advantage for us. A real block file system needs to read the block (usually 4k bytes), update 1 byte in it, and then write it back on disk. I feel like this needs additional testing with more random data, so test for yourself, too. Note that the minimum accounted request size in EFS is 4kB.

Reading: NFS I/O operations per second

Here are the results:

Reading from one EC2 instance using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS. One EC2 instance is enough to saturate the EFS limit.

Reading from two EC2 instances using 1 byte or 10k bytes: regardless of the request size, we get up to 3500 IOPS in total which are equally spread across the two EC2 instances.

The “PercentIOLimit” CloudWatch metric shows 100% when we do 3500 ops/s. Therefore, the total IOPS limit for reading is 3500 ops/s.

It turns out that there is no standard “apt” command which lists where a package was installed from. You may need this information if you have added additional APT repositories to your Debian/Ubuntu installation. I see a lot of questions at the forums (1, 2, 3, 4) and the proper solution tends to be “parse apt-cache output yourself”. Here is my solution which is very similar to this one:

Virtual servers like EC2 usually get a random external IP address which is not suitable for outgoing SMTP. That’s because these “pool” IP addresses lack reverse DNS resolving, and their spam reputation is unknown because somebody before you may have used them to send out spam.

Still you need to be able to get email notifications from these machines because many vital services like the crontab, for example, send diagnostic emails to “root” or other local mailboxes, depending on the user that a cron job is being executed with.

One possible solution is to catch all mail sent to any email address (local or remote), forward it to Amazon Simple Email Service (SES), and let SES do the actual SMTP delivery for you.

Open the file “/etc/postfix/main.cf” and add the following two statements there:

The first directive ensures that the “From” address is being rewritten to your single external destination email (read the docs), while the second directive forwards all locally delivered mail to the same single external email address (SF article). Note that if “alias_maps” directive already exists in the “main.cf” file, you need to comment it out.

You can configure the single external email address to forward to by creating the file “/etc/postfix/email_rewrites” and then putting the following in it:

/.+/ mailbox@example.com

Finally, execute the following commands, so that Postfix picks up the new configuration:

postmap /etc/postfix/email_rewrites
/etc/init.d/postfix restart

If you decided to use Amazon SES for email delivery, there are a few additional steps to do:

Project Key: leave empty or enter a project name to limit the scope of this notification only for it

Events: Issue Created

Condition: !issue.isSubTask()

Event: Issue Created (only parents)

Finally, navigate to “Project settings -> Your project -> Notifications” and then “Actions -> Edit”. At the bottom there is the newly created custom event “Issue Created (only parents)”. Add one or more recipients and enjoy.

A quick note when testing the custom event. If you added only yourself as the recipient of the new event, JIRA won’t send you an email when the event fired. JIRA correctly thinks that it doesn’t make sense to notify the creator of the issue, as he or she already knows what they did. Therefore, in order to test this properly, you need to add someone else’s account or group, and then create a new test issue. Or do it vice versa by adding your account and let someone else create a new test issue.

Many years ago I wrote the library popen_noshell which improves the speed of the popen() call significantly. It seems that now there is a standard and very efficient way to achieve the same. Use the posix_spawn() call. Its interface is a bit grumpy and complicated, but it can’t be very simple after all, because posix_spawn() provides both great efficiency and lots of flexibility.

Let us first examine the different ways of spawning a process on Linux 4.10. Here are the different implementations of the following functions:

In the latest versions of the GNU libc, posix_spawn() uses a clone() call which is equivalent to the vfork() arguments of clone(). Therefore, a logical question pops up – why not use vfork() directly. “The problem are the atfork handlers which can be registered. In the child process they can modify the address space.”

Of course, it would be best if posix_spawn() was implemented as a system call in the Linux kernel. Then we wouldn’t need to depend on the GNU libc implementations, which by the way differ with the different versions of glibc. Additionally, the Linux kernel could spawn processes even faster.

The current implementation of posix_spawn() in the GNU libc is basically a vfork() with a limited, safe set of functions which can be executed inside the vfork()’ed child. When using vfork(), the child shares the memory and the stack of the parent process, so we need to be extra careful indeed. There are plenty of warnings in the man pages about the usage of vfork().

I am glad that my implementation and this of the GNU libc guys is very similar. They did a better job though, because they handle a few corner cases like custom signal handlers in the parent, etc. It’s worth to review the comments and the source code of the patch which introduces the new, very efficient posix_spawn() implementation in the GNU libc.

In order to achieve a desired fault tolerance, we must examine and understand how the cluster responds to node failures. This Percona blog article gives some good insight, but some more clarifications are needed.

There are two different kinds of failures that may occur while the cluster is operating:

Simultaneous failure of two or more nodes.

One-by-one failure. This is the case when a node fails, the cluster notices this and reacts before another node fails, too. The usual time for reaction is 5 seconds which is controlled by the “suspect timeout” setting (evs.suspect_timeout).

When one or more nodes fail, the cluster reacts in a very clever way:

The remaining alive nodes run a quorum vote, in order to determine if their count is >50% of the last cluster size.

If the cluster can continue to operate with a quorum, it re-adjusts its size! This is a crucial feature which lets the cluster lose nodes one-by-one until only two active nodes are online.

❓ Now the question is, if we have a cluster with 3 data nodes, is it possible to lose 2 of them, and still continue to operate? The answer is “yes”, but only if you lose them one-by-one, and only if you run an additional arbitrator node in your cluster. Note that the arbitrator does not store any data but still participates in the whole network replication traffic, in order to be able to vote.

Three data nodes and an arbitrator node make a cluster with a size of four nodes. When one of the nodes fails, 3/4 of the nodes are alive which is >50% quorum and the cluster continues to operate by reducing its size to three. When a second node fails, 2/3 of the nodes are alive which is >50% quorum and the cluster continues to operate with a size of two. You cannot lose a third data node since you have only three initially anyway. 🙂

If you lose two data nodes simultaneously, or if you lose one data node and the arbitrator (total of two nodes) simultaneously, you are out of luck. Only 2/4 of the nodes are alive which is not >50% quorum. The cluster stops to operate, in order to prevent a possible split-brain situation.

Here is a summary on how many nodes you can lose with the two different cluster configurations. Note that the arbitrator counts as a regular node when it comes to losing it:

So running an arbitrator in a three-node MySQL Galera Cluster makes total sense, if you can allocate one more separate machine with the same network capabilities.

✩ Note that regular MySQL node shutdowns are handled differently by the cluster. When a node leaves the cluster via a normal shutdown, it informs all members of the cluster about this. Therefore, it should be safe to shut down even 2 out of your 3 data nodes simultaneously.