Parallel

Performance Analysis Tools for Linux Developers: Part 2

By Mark Gray and Julien Carreno ', October 23, 2009

Setting performance profiling and analysis goals

Mark Gray is a software development engineer working at Intel on Real-Time embedded systems for Telephony. Julien Carreno is a software architect and senior software developer at specializing in embedded Real-time applications on Linux

In Part 1 of this article, we summarized some of the performance tools available to Linux developers on Intel architecture. In Part 2, we cover a set of standard performance profiling and analysis goals and scenarios that demonstrate what tool or combination of tools to select for each scenario. In some scenarios, the depth of analysis is also a determining factor in selecting the tool required. With increasingly deeper levels of investigation, we need to change tools to get the increased level of detail and focus from them. This is similar to using a microscope with different magnification lenses. We start from the smallest magnification and gradually increase magnification as we focus on a specific area.

Methodologies

In any performance analysis or profiling exercise, it is the authors' experience that there are two critical pieces of information that need to be present from the start:

What is my expected system behavior? In other words, how do I expect the system to behave under normal conditions? In a structured project environment, this translates to a very clearly-defined set of requirements at a system level as well as, possibly, at an individual component or application level.

What is my problem statement? Simplistically, this can be one of two possibilities:

My system is not behaving according to expectations.

My system is behaving as expected, but I want to know what "makes it tick". I want to be able to answer questions such as: "Where are my CPU cycles being spent?", "How much memory am I really using?" This information can be used to understand any inefficiencies in my algorithm or problem areas. This information may also be used to accurately predict how the system will scale to support higher workloads.

When items 1 and 2 above are clear, you have effectively determined "where you are" and "where you want to be". For the purposes of this article, we focus on scenarios in which the system is not behaving according to specifications rather than measurement on a working system.

From experience, it is critical to apply a structured method at the start of any performance analysis since any activity with an inappropriate tool can be a complete waste of time. Performance can be broadly affected by issues in three distinct areas: CPU occupancy, memory usage and IO. As a first step, it is absolutely essential to determine which area your problem is coming from since the tools mainly focus on one of these three areas to provide any kind of detailed data. Hence, the first step is always to use general tools that provide a high-level view of all three areas simultaneously. Once, this has been done, the developer can delve deeper into a specific area using tools with an increasing level of detail and potentially more and more invasiveness. It is advised not to make any assumptions regarding the category the investigated problem falls under and skipping the first high-level analysis. Assumptions such as these have proven in the past to be counter-productive on numerous occasions.

When doing performance analysis on a working system to understand what makes it tick, it is important to take into account a number of things. Avoid any over-kill. For example, if only a simple CPU performance measurement of a working system is required, it may be sufficient to use a non-invasive high-level analysis tool such as ps. The depth of analysis should be determined "a priori" by all interested parties.

Start at the 10,000 ft View

As stated earlier, the starting point of any analysis should be a set of system-level measurements meant to provide an indication of the system state, most notably:

CPU occupancy, total and per logical core

Memory usage, snapshot and evolution over time

IO, CPU IO waits

For our purposes here, it is assumed that we are dealing with finding a single problem area at a time during our analysis, figuring out what that area is that brings us here. Scenarios covering analysis of a system with both CPU occupancy and memory usage problems, for example, is not covered here.

Figure 1: top View (Fully-Loaded Single Core System)

Figure 2: top View (Half-Loaded Dual-Core System)

Figure 3: sar System-Wide Increased Memory Usage View

Figure 4: sar IO Wait CPU Usage View

Figure 5: ps View (Loaded System)

Figure 6: iostat View (Loaded System)

Using some of the examples above, having already applied our methodology of performing a high-level analysis that includes CPU, I/O, and memory performance for all the scenarios below, we can see in Figure 1 that our CPU usage is approximately 90%. Our main problem here is CPU occupancy as the vast majority of cycles are being spent in user space. Our next step should be to examine more closely the applications running on the system. Using ps, in Figure 5, we can see that we have a number of applications running concurrently on the system and that our VoIPapp is by far the biggest CPU user. We should examine our VoIPapp in more detail, see "CPU Bottlenecks".

In Figure 2, we can see that our overall CPU occupancy is just under 50%, however we are using 99% of one core and virtually nothing of the second available core. We should examine our threading model, see "Optimizing a Complete System" and "CPU Bottlenecks".

We can see between Part 1, Figure 7 and Figure 4 that, over time, our memory usage is increasing, further measurements may indicate that we have a memory leak that is affecting system behaviour, see "Investigating a Memory Issue". From Figure 5, we can see that the CPU is spending an inordinate amount of processing time waiting on IO. We should investigate the reason for the high number of IO waits, see "IO Bottleneck Issue". Optionally, we can use iostat to assess the loading of the block devices in the system to quickly determine if they are a factor in the bottleneck. For instance, in Figure 6, it is apparent that during the file copy, the bottleneck is the block device which is highly loaded.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!