windows.h is for QueryPerformanceCounter used only in main(); compile with gcc -Ofast

It iterates/counts all solutions + prints first of them (as 01 board) in 0.23ms on my i7 (if you redirect output to file). You can also run it for N queens changing #define Qs to something bigger.
The code as you see is silly (not needed conditional in main loop among other things) and not optimized, still it's pretty fast :) I am not sure what hardware they used but it takes 10x less time on my computer vs theirs.

Hi bluecalm! I tweaked both your code and mine to uselessly solve the problem 10000 times, and not pretty print anything, so the difference in raw 8-queens-solving speed is readily apparent. Here are the results:

Hi,
I am happy to see that asm magic gains over naive code :)
Feel free to host/use the code in any way you like!

It will be nice addition to see how hand tuned code beats naive C implementation.
(I hope your code iterates over all solutions as well, the easiest way to change mine to stop after 1st is a goto from solve to main :))

EDIT:
Yeah, I meant if it counts all of them and I've just noticed 92 exit code :)
I didn't expect that big a speed-up, impressive!

EDIT2:
to answer children post:
That's great, please change -O3 to -Ofast though if you are using new GCC. It shouldn't matter much but it sometimes does; I believe -march=native won't help with generated assembly though
I seems that your code from original article spent all its time in pretty printing. I am happy to see some real comparison now; I am for sure learning a lot here once I understand everything in the article.

It counts up every solution, thus the "92" exit code. However, the non-pretty-printing version does take some shortcuts - it can't reconstruct the entire current board state in O(1) time from its actual register state. So I'm not sure exactly what you mean by "iterate over all solutions". (It certainly doesn't bail out when it finds the first solution, if that's your question.)

Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)

Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:

One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(

My immediate question was whether another compiler might be more competitive with the assembly. On my Sandy Bridge processor, I found that although GCC was the slowest, the assembly was still the clear winner:

I was surprised by Clang here. It's an older version than the other two, yet fastest. A quick glance at 'perf stat' (which I used for the timing) says that although it's executing fewer instructions per cycle, it's managing to use fewer instructions than the other two.

Although it doesn't seem to make much of a difference here, those are odd flags for Sandy Bridge. Core2 is a previous generation, and it supports AVX which came out after SSE4.2.
If you did to specify SB, you'd want the unwieldy "-march=corei7-avx". But probably better just to use '-mavx' or 'march=native', along with -Ofast or -O3.

I couldn't resist the challenge, so I wrote a non-recursive solution [0] in C++. It takes about 0.73s on an E5-4650 to loop 10000 times. Since writing it I've tried testing it against bluecalm's solution and how they compare seems to be fairly sensitive to the particular compiler and hardware combination. But it's at least a slightly different take.