[edit#2] If anyone from VMWare can hit me up with a copy of VMWare Fusion, I'd be more than happy to do the same as a VirtualBox vs VMWare comparison. Somehow I suspect the VMWare hypervisor will be better tuned for hyperthreading (see my answer too)

I'm seeing something curious. As I increase the number of cores on my Windows 7 x64 virtual machine, the overall compile time increases instead of decreasing. Compiling is usually very well suited for parallel processing as in the middle part (post dependency mapping) you can simply call a compiler instance on each of your .c/.cpp/.cs/whatever file to build partial objects for the linker to take over. So I would have imagined that compiling would actually scale very well with # of cores.

But what I'm seeing is:

8 cores: 1.89 sec

4 cores: 1.33 sec

2 cores: 1.24 sec

1 core: 1.15 sec

Is this simply a design artifact due to a particular vendor's hypervisor implementation (type2:virtualbox in my case) or something more pervasive across more VMs to make hypervisor implementations more simpler? With so many factors, I seem to be able to make arguments both for and against this behavior - so if someone knows more about this than me, I'd be curious to read your answer.

Thanks
Sid

[edit:addressing comments]

@MartinBeckett: Cold compiles were discarded.

@MonsterTruck: Couldn't find an opensource project to compile directly. Would be great but can't screwup my dev env right now.

@Mr Lister, @philosodad: Have 8 hw threads, using VirtualBox, so should be 1:1 mapping without emulation

@Thorbjorn: I have 6.5GB for the VM and a smallish VS2012 project - it's quite unlikely that I'm swapping in/out trashing the page file.

@All: If someone can point to an open source VS2010/VS2012 project, that might be a better community reference than my (proprietary) VS2012 project. Orchard and DNN seem to need environment tweaking to compile in VS2012. I really would like to see if someone with VMWare Fusion also sees this (for VMWare vs VirtualBox compartmentalization)

Probably the file I/O slowing it down with multiples tasks and the disc access being to the virtualised drive
–
Martin BeckettAug 11 '12 at 4:32

3

I'd like to reproduce this on my own machine. Can you please upload a sample project somewhere? I suspect the virtual machine is playing tricks here. Try booting to Windows natively (Bootcamp) and see if you observe the same behaviour --I doubt you will.
–
Apoorv KhurasiaAug 11 '12 at 5:01

1

What are we compiling here? Lots of time the overhead of parallelizing a task doesn't pay off until you hit certain scale. See how compiling apache or ravendb does.
–
Wyatt BarnettAug 11 '12 at 11:00

2

You probably run out of memory in your virtual machine so it starts swapping.
–
user1249Aug 11 '12 at 15:52

1

Same thing has happened to me before with Java using Maven 3.x to compile on an i3. Letting it default to "4" threads was much slower, near 50% slower, than telling it explicitly to only use 2 cores. I think it has something to do with the hyper-threading context switching and overlapping I/O.
–
Jarrod RobersonAug 11 '12 at 18:49

3 Answers
3

Answer: It doesn't slow down, it does scale up with # of CPU cores. The project used in the original question was 'too small' (it's actually a ton of development but small/optimized for a compiler) to reap the benefits of multiple cores. Seems instead of planning how to spread the work, spawning multiple compiler processes etc, at this small scale it's best to hammer at the work serially right off the bat.

This is based off the new experiment I did based off the comments to the question (and my personal curiosity). I used a larger VS project - Umbraco CMS's source code since it's large, open sourced and one can directly load up the solution file and rebuild (hint: load up umbraco_675b272bb0a3\src\umbraco.sln in VS2010/VS2012).

NOW, what I see is what I expect, i.e. compiles scale up!! Well, to a certain point since I find:

Takeaways:

A new VM core results in a new OS X Thread within the VirtualBox process

Compile times scale up as expected (compiles are long enough)

At 8 VM cores, core emulation might be kicking in within VirtualBox as the penalty is massive (50% hit)

The above is likely because OS X is unable to present 4 hyper-threaded cores (8 h/w thread) as 8 cores to VirtualBox

That last point caused me to monitor the CPU history across all the cores via 'Activity Monitor' (CPU history) and what I found was

Takeaways:

At one VM core, the activity seems to be hopping across the 4 HW cores. Makes sense, to distribute heat evenly at core levels.

Even at 4 Virtual cores (and 27 VirtualBox OS X threads or ~800 OS X thread overall), only even HW threads (0,2,4,6) are almost saturated while odd HW threads (1,3,5,7) are almost at 0%. More likely the scheduler works in terms of HW cores and NOT HW threads so I speculate perhaps the OSX 64bit kernel/scheduler isn't optimized for hyper threaded CPU? Or looking at the 8VM core setup, perhaps it starts using them at a high % CPU utilization? Something funny is going one ... well, that's a separate question for some Darwin developers ...

[edit]: I'd love to try the same in VMWare Fusion. Chances are it won't be this bad. I wonder if they showcase this as a commercial product ...

Footer:

In case the images ever disappear, the compile time table is (text, ugly!)

I suspect the drop between 4 and 8 is a combination of the VM not being optimised for HT, and HT not in any way being equal to twice as many cores (at best a 30% performance increase, usually far less).
–
Daniel BAug 13 '12 at 6:25

@DanielB: At 4=>8 cores, the issue isn't just that it's a mere +30% boost (vs +100%) like you suggested - it's that the performance is actually -50%. If the hardware threads were totally 'dead/useless' and work was being diverted to the other cores, the performance delta would be 0. So therefor I'd be more inclined to say it's the design on the VirtualBox type 2 hypervisor. I wonder how VMWare Fusion is ...
–
DeepSpace101Aug 13 '12 at 8:33

"At one VM core, the activity seems to be hopping across the 4 HW cores. Makes sense, to distribute heat evenly at core levels" - not necessarily, it is usually better to re-schedule on the same core (for cache etc) but the hypervisor is just picking one at randon, or the least-used core because it thinks its a general-purpose processing where other processes are using those cores. In this case, the scheduler optimisation works against you (but in a very minor way)
–
gbjbaanbAug 13 '12 at 9:40

@Sid agreed, I'm just pointing out that with HT you're going to get (greatly) diminishing returns a lot sooner than you'd think, if you assumed it's actually anything like a 100% improvement. In this case, it could easily be contention for your HD that's causing this, hence my earlier suggestion for some artificial CPU benchmarks.
–
Daniel BAug 13 '12 at 12:26

There is only one possible reason for this to be happening, which is that your overhead is exceeding your gains.

You may be emulating the multiple cores, rather than assigning actual cores or even processes or even threads from the host machine. That seems pretty likely to me, and obviously is going to give you negative speedup.

The other possibility is that the process itself doesn't parallelize well, and even attempting to parallelize it is costing you more in communication overhead than you're gaining.

your overhead is exceeding your gains: True but that pretty much covers everything without knowing what is really causing it :) ... I'm using VirtualBox and have the physical cores, so assumed the mapping should be 1:1 without emulation. I'm going to search for a LARGE open source VS2012 so others can reference it too... brb
–
DeepSpace101Aug 12 '12 at 17:00

@Sid according to this answer superuser.com/a/297727 the virtualbox VM should use the host cores appropriately. But I'd still check out what is happening on the host, to make sure that the expected behavior is occurring.
–
philosodadAug 13 '12 at 1:03