Now, did this achieve the speedup we'd like? Since there is virtually no synchronization between threads to cause overhead, one might have expected 2 threads to cut the time in half, so why didn't it?

The answer is that the CPU is way under utilized, the algorithm is actually IO bound and that the CPU spends most of it's time waiting for data.

Now which is better/more scalable?
Taking up one CPU for .37s, or two CPUs for 0.31s?

If this is the only process, then admittedly yes .31s is better. However, if there are any other CPU bound processes waiting, then the .06s tradeoff is starting to look pretty expensive since I could have been running a CPU bound process simultaneously for .37s instead of tacking it on top of .31s.

I was going to argue that CPUs don't have to do nothing while a process is blocking for I/O (one just has to switch to the next process in the scheduler's queue), but then I had a look at your code and noticed that it was actually essentially about accessing lots of RAM, in which case it's the CPU core that has no choice but to block, since MOVs are blocking by design.

Interesting problem, actually. I think once one thread starts to hog the memory bus like this, we're doomed anyway, since all other threads on all other cores are going to fight for access to the memory bus and get suboptimal performance. This problem won't fully be solved until the memory bus goes faster than all CPU cores together, which itself is probably not going to happen until we reach the limits of CPU speed. But this is open to discussion.

Keep in mind we haven't even introduced the need for multithreaded synchronization, which is extremely expensive.

As said in the blog post mentioned earlier, if you need so much synchronization that your algorithm performs no better in parallel than in a serial version, then it's indeed better to do it the async way and not bother with the synchronization's coding complications. What I consider is tasks where most of the work can be done in parallel, and synchronization/communication is only needed at some specific points of the execution.

For these reasons, it's typically better to exploit parallelism at the macro-level instead of parallelizing individual steps. This is usually a good thing because it's easier to do with much less synchronization.

And there comes a trade-off between fine-grained modularity and macro-level parallelism I think this one can be solved by not pushing function isolation through a process too far. If I take the FTIR touchscreen driver example above, I think that putting a boundary between "large" tasks like decoding pictures and detecting blobs is okay, but that putting a boundary between two FFTs in the JPEG decoding process is not. Where exactly the limit is can only be decided through measurements on the real-world system, and will probably evolve as CPUs themselves evolve.

It also lends credence to my cherished view that async IO interfaces usually can perform better than blocking threaded ones.

Again, if it's pure I/O and (almost) no computation, I admit that it's totally possible, and even frequent