I recently discovered a Linux shell script that was running slowly due to an inefficiently implemented loop. This innocent investigation ended up uncovering misleading information from time and a bad interaction between the Linux thread scheduler and its CPU power management algorithms.

How often does an eight line shell script expose such rich information?

Conclusions first, explanations later

Using $(expr) to do math is over a hundred times slower than using bash internals. This is not a new discovery, but it’s worth reiterating.

Excessive use of sub processes (see point 1) can confuse the Linux kernel’s power management algorithms, leading to an additional three times slowdown. Interesting…

time gives accurate elapsed time information but its CPU time information should not be trusted, especially if you make excessive use of sub processes (see point 1). Interesting…

When time lies you can always use the Windows Task Manager to measure Linux CPU consumption. Amusing.

Looping in Windows batch files can be faster than looping in Linux shell scripts.

Process creation in Windows is slower than on Linux. Not surprising.

This is part of a series of posts about Linux. Previous posts include:

The setup

We recently added a new shell script to our build process and I noticed that this was causing a two times to three times slowdown on our builds. Not good. One of my coworkers found that the use of $(expr) to loop over arguments was the culprit. A minimal repro is shown here:

Note: this code copies/pastes much better on Chrome than on Internet Explorer. FWIW.

On my 3.2 GHz six-core twelve-thread Core i7-3930K this script took almost two seconds to run and gave results like this:

Just did 1000 iterations using expr math

real 0m1.883suser 0m0.096ssys 0m0.260s

This means that my machine can do only 530 iterations a second using this technique because each $(expr) statement is starting up a child process in order to decrement an integer. Linux process startup has a reputation for being fast, but this is a reminder that ‘fast’ is not the same as ‘free’. So far so good.

I tested to see whether expr was particularly slow by having a loop that invoked $(nop) a thousand times, where nop was just a C program that just returns immediately. The cost to invoke $(nop) was measured to be about 20% lower than the cost to use $(expr) to do some math, so not really enough faster to be interesting.

In other words, using the built-in decrement operation ran about 400 times faster than using $(expr)! Not a huge surprise – experts know this already – but still, nice.

Anomaly the first, CPU usage

A less OCD person might have called a 400 times speedup sufficient, but I’d noticed a few peculiarities during the investigation that I wanted to investigate. Noticing and investigating anomalies is, I believe, a critical skill for a developer, as anomalies often reveal mistaken assumptions, bugs, or ‘jankiness’.

The results from time had been confusing throughout the investigation. They consistently showed that only about 15-25% of the elapsed time in the $(expr) loops was CPU time. This then caused me to not use CPU profilers during the investigation but instead think about what could be causing stalls. Network delays? Disk I/O? Nothing made sense.

I eventually ran iostat 1 in another window and noticed that it said that CPU usage during my test runs was about 8.5%. On my twelve hardware thread (six cores, hyperthreaded) system this corresponds to one CPU core. In other words, total CPU time was equal to elapsed time, the script was CPU bound, and time was lying to me.

I then used a stopwatch to check that time’s elapsed time calculations were correct. Finally I used multiple other tools to confirm that my ExprCount() shell script, running on an otherwise idle machine, was indeed CPU bound:

Running the test script under perf stat showed 0.945 CPUs utilized

Running mpstat -P ALL 1 in another terminal showed one core worth of CPU time being consumed

Context switch graphing (see screenshot in the next section) showed that the task was CPU bound

It also made sense that the loop would be CPU bound. After the first invocation of $(expr) everything is in the disk cache so there should be no I/O latency. All of process creation, reading from the disk cache, process execution, and process teardown is done by the CPU, so the overall task should be entirely CPU bound. And it is.

So, be aware that time always underestimates CPU time, and that this underestimation is roughly proportional to how many sub processes you create per second. Presumably the process creation time is being attributed neither to the parent nor the child process, even though it really should be, as sys time.

This anomaly occurs both with the bash time-builtin and with /usr/bin/time. I assume that the problem is ultimately a limitation of what the Linux kernel reports back to time, so the proper fix here is probably to acknowledge the limitation in the man page. I’m discussing this with the man-page maintainers.

Since time is mostly just a wrapper around the wait3() call it makes sense that both time versions give the same results – they are just reporting what the kernel tells them.

To be crystal clear, my complaint/concern is that when time is used to measure my ExprCount() loop it makes the loop seem like it is not CPU bound, when it is. This can lead a performance investigation astray. I believe that this should be fixed, or documented.

Anomaly the second, CPU frequency

I was about ready to declare the case closed when I noticed another anomaly. The ExprCount() time was usually extremely stable, but occasionally it would run three times faster – usually it was 400 times slower than BashCount() but sometimes it was only ~133 times slower. Now that is an anomaly worth investigating.

I quickly discovered that if a busy loop CPU hog was spinning away in another process then the ExprCount() loop would run three times faster. The loop that used bash math was unaffected. I immediately suspected power management, and I eventually verified this by running the test scripts under perf stat. Running the tests under perf stat was the most rewarding because it clearly shows the CPU frequency, and it showed that it was far lower when running the ExprCount() loop:

The perf stat results also hint at why the CPU frequency is lower on the ExprCount() tests. Perf stat shows that the ExprCount() loop does thousands of CPU migrations compared to zero in the BashCount() loop.

Once I started to suspect lots of CPU migrations it was interesting to see what other tools might be useful. Running mpstat -P ALL 1 while running the two different count loops showed dramatic results. When the counting was done with $(expr) then the CPU usage was distributed relatively evenly across all of my CPU cores. However when the counting was done with bash internals then CPU usage was localized entirely to one core.

As a final bit of investigative reporting (I enjoy this sort of stuff way too much) I used the Zoom profiler to record context switches while running the $(expr) loop. The results, color coded by process and put on per-CPU racetracks, are shown below:

The bash process is in red and we can see that it spends very little time running. The other colors are various invocations of expr which bash is usually waiting for. The fascinating thing is how often the processes migrate between CPUs. The bash process alternates between CPUs 2 and 5, and each expr process starts on one thread and is then forcibly migrated to another.

We can see that the overall task is CPU bound. The bash process runs for a fraction of a millisecond, then expr launches, then expr moves to another core, then expr terminates and the cycle repeats.

After discussions with some kernel developers it appears that new processes are placed on the least-busy core, however that doesn’t explain why bash keeps moving or why each expr instance moves once.

The net result is that none of the CPUs are fully loaded which, apparently, means that the power management algorithms don’t realize that a CPU bound task is running. It’s kind of crazy awesome, and it is an example of when a multi-core system actually runs (a lot) more slowly than a single-core system.

Setting thread affinity is usually a bad idea because it reduces the kernel’s scheduling flexibility and may cause unexpected interactions with other processes. However, in the interests of science I knew I had to run my test script under taskset 1 to see what happens when it is forced to a single CPU. Doing this caused the average CPU clock speed in ExprCount() to triple, and performance increased identically.

Please don’t work around this by setting your power management settings to High Performance. If every Linux user did that then the cumulative extra power draw would be significant and would make me regret writing this article. The behavior is an anomaly. It probably should be fixed, but in most situations I’m sure that the power management algorithms work fine. Just don’t go create hundreds of child processes per second.

I tested this on Ubuntu 12.04.1 (32-bit) and Ubuntu 12.04.2 (64-bit). The timings were slightly different, but the conclusions were fundamentally the same.

Measuring Linux performance using Windows

While trying to decide which tools to trust regarding CPU consumption I decided to let Windows be the tie breaker. I have a Linux VM on my laptop so I ran my tests in the VM and then monitored the CPU consumption of the VM with Task Manager. It correctly showed that the VM, and therefore the task running inside it, was CPU bound. Which made me laugh.

Measuring Windows Performance

In the interests of science I made some similar measurements on Windows. First, a simple batch file that counts down using internal commands.

@echo off
echo Starting for loop at %time%
for /L %%i in (400000,-1,0) do (
rem empty statement
)
echo Finishing at %time%

This ran in about a second, so 400,000 iterations per second.

Then I added a call to nop.exe to the loop to measure minimal process creation time (and reduced the iteration count to 1,000). Here’s the new batchfile:

@echo off
echo Starting for loop at %time%
for /L %%i in (1000,-1,0) do (
nop.exe
)
echo Finishing at %time%

This showed that a Windows batch file can launch about 133 processes per second. A lot slower than Linux.

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And juggle.

If I recall, the C-style for loop you are using is a bashism and it will not work on a POSIX-compliant shell like sh (or dash, which is stricter than bash). Ubuntu symlinks sh to dash because dash is faster than bash at executing init scripts. Some people symlink sh to busybox for faster startups

You can get a faster loop by using seq instead of using bash’s math routines. On my machine, “for i in $(seq $1 -1 1)” runs 25% faster than your “for (( i = $1 ; i > 0 ; i– ))”. It should also be more portable and work with dash, which should be even faster than bash.

Using $(seq) would only be faster because my test does a single loop that runs for a very long time. In our actual test script we have lots of loops that execute fewer times — 20 or so. The sub-process overhead of $(seq) would then be enough to be a problem.

That leaves me still unsure as to what options I have for a fast and portable loop. I’ll experiment with dash.

If the bound is shared between multiple loops you could generate the list once and store it in a variable, so you don’t need fork a new process every time. You could also generate the list in the shell (or hardcode it) if you want to avoid depending on seq at all, since seq isn’t standard.

And the only option for a portable for loop is “for some_variable in some_list”. Everything else is a non-standard extension.

Good to know. In the original script the range was slightly different each time so it appears that there is no option for a portable and efficient loop. We’ve chosen efficient for now. Eventually we’ll probably recode it in C++.

In my testing, all methods (bash math, seq, brace expansion), except the “expr” one, spike my CPU (i7-3520M) to max clock (and 100% util on one core). The disparity you see with your loop using expr is most likely caused by the CFS scheduler needing to balance lots and lots of new processes in a very short time, trying to balance the run queue between cores, and since no single core ever reaches a utilization threshold the CPU is kept scaled down.

The disparity you see when using ‘time’ (I can only reproduce it using the ‘expr’ tests btw.) is probably down to the bash process waiting for the ‘expr’ command to complete, time spent waiting like this does not count towards either user time or system time (just like I/O wait)

Your results match mine. The scheduler puts the child processes on idle cores (and when it returns to bash puts bash on an idle core!) which means that no core is heavily utilized which means that the CPU frequency is kept down. It’s a bad interaction between the scheduler and the frequency adjustment that happens anytime many sub-processes are used.

Your analysis of the ‘time’ results is not quite correct. You are correct that it only happens when using sub processes and that the time which bash spends waiting is (correctly) not counted as CPU time. However, the entire time that ‘bash’ is waiting the CPU is working very hard to create and run the expr process. It is this process creation time that is not accounted for. Since child-process execution time is counted (I tested that) it feels like a bug to not count child-process *creation* time. Showing a task as 80% idle when the CPU is working to perform that task 100% of the time is dangerously misleading.

The behaviour of time in these cases is expected, time accounts wall clock (correct in these cases), and cpu time spent in userland and kernel _for the processes you are monitoring_. Time spent in the scheduler is not accounted for (since you can never be sure it was spent for your processes). If you would try to read large amounts of data from something like /dev/urandom, or perform lots of system calls you will see system time being accounted for your process.

“Expected” by whom? I can find it documented nowhere that process creation time is not counted.

I also don’t think it’s correct to describe the time as being spent in the ‘scheduler’. The time is being spent in process creation. Data from ‘perf’ (graphed by zoom) attributes the time to expr and I’m sure that wait3() (used by time) could attribute it similarly — the ownership of the time seems clear and unambiguous.

I haven’t actually looked at the kernel source to see what it would take to attribute the process creation time to the process being created, but it seems like it should be manageable.

There is a tiny fraction of time (<1%) in the migration process. It would be reasonable and expected for that to not be counted by 'time'.

As for the “expected”, no single version of time I have ever used (be it shell built-in or /usr/bin/time) be it on linux or one of the other *NIX-es has ever reported on time spent by the scheduler and/or process creation time.

If you want to dig deeper I would suggest attaching a SystemTap script to process creation and context switching, to find out _exactly_ where your time is going, and why time(1) is not reporting on it.

Oh, and limiting the expr test to only run on one core gets rid of the cpu scaling problem you experienced. (you still have the time disparity, but again, that’s probably due to the loop waiting for the expr command to exit):

Not to diminish it, but I think the underlying problem is that using expressions at all in bash is nonidiomatic. I think I’d have written the loop as
for i in $(seq 30 -1 0); do …
but glancing at a page on bash scripting it appears that bash now has built-in support for ranges that don’t rely on an external binary (seq):
for i in {30..0}; do

TempleOS will blow you out of the water outputting counting numbers to the command-line–even C linux code –, but it’s a dumb benchmark. Also, I can change tasks 3,000,000 times in one second per CPU core.

Since my machine has twelve hardware threads and the load is evenly distributed I’d have to set that threshold to ~8 (default is 95) to get the CPU to ramp up on demand, and that is too low for normal usage.

The only fix would be to get the scheduler to not spread the workload out so much. If it only used one or two hardware threads then the frequency management code would have a fighting chance.

Having let it percolate for a few months, what ended up happening on this topic? One thought that came to mind as I was reading is I wonder how a small embedded interpreter like Lua or Scheme would perform? And in the interest of joining the crowd, here are my own numbers from my 3770k:
wyatt@Yue ~/tmp $ ./sloloop ExprTest
Just did 1000 iterations using expr math

real 0m0.520s
user 0m0.257s
sys 0m0.262s
#That’s at 1.6-3.5GHz. Still slow, but I’m a bit curious why the same code is four times faster than on your system.
wyatt@Yue ~/tmp $ ./sloloop BashTest
Just did 150000 iterations using bash math

real 0m0.770s
user 0m0.733s
sys 0m0.036s
#No surprises here, and a POSIX-compliant while loop with static busybox is only maybe a smidgen faster from what I’m seeing. Using a bash range is not quite three times faster (~280ms for 150k)

Very little came of this, unfortunately. Updates on some of the points:

2) The Linux kernel’s power management is still confused by lots of short-lived processes. This is a tricky problem and I think Windows has the same flaw. This does make some real tasks run ~3x slower.

3) time still gives inaccurate information about CPU time consumed, with no documentation or disclaimers of this error. There was much discussion about this but I think that ultimately the maintainers didn’t believe that the task was CPU bound, despite all the evidence. That’s a pity. I hate tools that lie.

Lua or Scheme would probably be faster — anything that doesn’t involve crazy amounts of process creation will avoid most of the problems.

We reduced our use of child processes significantly so the slowdown is now acceptable, so I moved on.

I would seriously recommend always running any desktop machine with “performance” scheduler. As long as you don’t disable C1-C6 sleep states for the processor, an average desktop processor core will be halted anyway for most of the time. The idea is to keep the processor halted (somewhere in C1-C6 state) as much as possible and only run the CPU for short periods of time. See `sudo powertop` for statistics…

I just wish that Linux could detect that a serially dependent chain of CPU bound processes was running and treat that like any CPU bound task, by ramping up the CPU frequency. But, it is a hard problem and I think Windows has the same limitation, just less frequently hit.