Plumbr now detects slow filesystem operations

May 17, 2016 by
Ivo Mägi
Filed under:
I/OProduct Updates

Plumbr is all about reducing the time spent while troubleshooting performance issues. Our automated root cause detection means that whenever the user experience starts to degrade, Plumbr will capture the required information from within the application.

As a result, Plumbr users are exposed to the root cause, down to the single line in source code. Our aim has always been to detect all the possible reasons for the poor performance.

We are already doing a good job for that matter and as of today we have improved even more. Whenever the application at hand is suffering from heavy filesystem usage, Plumbr now exposes the culprits.

The problem

Java applications carry out the input/output operations on a physical disk via streams. The performance issues occur during the read & write operations from & to a file on a disk, as the application code needs to wait for the file to be read or written before continuing. The problem with this is the access time. The access time is the time required for a computer to process a data request from the processor and fetch the data from the storage device.

With the introduction of SSDs, there has been some relief introduced. Instead of the mechanical disk where you need to wait for the disk to rotate to the required disk sector, the SSDs use integrated circuits to store the data. Without limitations of the mechanical components, the SSDs perform much better.

Instead of the spinning disk, where access time can reach 10s of milliseconds in the bad cases, the added latency can now be measured in tens of microseconds. Comparing to RAM, where the latency is measured in tens of nanoseconds, it is however still orders of magnitude slower.

To illustrate the difference, consider the following example:

If fetching data from memory would be equivalent to a jet fighter capable of doing 1,200 mph,

then fetching data from an SSD drive would be comparable to a toddler wandering around at a pace of 1.2 mph

and the spinning disks would be more like snails, crawling just around six feet per hour.

So, even though there are a lot of optimizations like buffering and prefetching involved on all the levels, we have to pay a hefty price for each request to the file system. It is still very easy to end up with a huge latency increase, especially when reading or writing in many small chunks.

Example

In case when the poor use experience is traced down to filesystem, Plumbr exposes the root cause similar to the example below:

As seen from the above, the particular problem was the end user being forced to wait for seven seconds due to the file system access. What makes the case interesting are the details exposed by Plumbr:

The file at /home/prod/fs1 folder is first read one byte at the time to calculate its size. As the file is 23MB in size, this results in 24,295,767 byte-size read operations being carried out. This takes almost all of the seven seconds.

As the next step, when the size is calculated, the file is read again, now all in one go, and carrying out the whole operation in just 14 milliseconds.

Getting exposure to programming errors like this allows you to patch your code quickly. The example above also includes the location in source code where the operations are launched, so all you are left with is to proceed with the optimization. So in case you are already eager to see how your application is performing in regards of filesystem usage, go and grab the free trial to start monitoring.

How we implemented the support

The Plumbr Agent monitors file reading and writing operations performed by using FileInputStream and FileOutputStream classes. We have detected several problems impacting reading from or writing to a file streams:

Lack of buffering: each read or write operation incurs overhead. The exact overhead depends on the operating system, file system and hardware in use. To reduce this overhead, the common solution is to carry out the read or write operations in bulk, instead of reading or writing one byte at a time. A simple solution to the buffering problem would be to introduce BufferedInputStream or BufferedOutputStream around the unbuffered streams.

Storage issues: like we said above, the performance of file operation depends on the operating system, the file system and the hardware. It is sometimes the case that one of these becomes the bottleneck, and even a single file stream operation could take tens of seconds.

Second part of the operations being monitored by Plumbr revolves around filesystem metadata operations. The Plumbr Agent monitors file attribute querying performed by using methods such as File.exists(), File.isDir(), File.canWrite() and so on. While individual operations like that are usually handled in just a few microseconds, having a large number of them may result in a poor user experience. One of the most common cases we have detected is recursively walking a large directory that contains millions of files.