Stuff

I received an email recently from a member of the Microsoft Visual C++ compiler team who is working on the AMD64 compiler, regarding my comments about intrinsics support in the VC++ compiler. Given my past feedback on this blog and in the MSDN Product Feedback Center on the quality of the intrinsics in VC++, one of two possibilities was possible:

I had mortally offended the Visual C++ compiler team and had received a notice to appear in Redmond for a formal challenge to the death; or

They wanted to inform me of significant improvements made to the compiler in the Visual Studio .NET 2005 "Whidbey" public beta.

Fortunately, the team member turned out to be a nice guy and informed me that intrinsics support had indeed been improved in Whidbey.

To review, compiler intrinsics are psuedo-functions that expose CPU functionality that doesn't fit well into C/C++ constructs. Simple operations like add and subtract map nicely to + and -, but four-way-packed-multiply-signed-and-add-pairs doesn't. Instead, the compiler exposes a __m64 type and _m_pmaddwd() psuedo-function that you can use. In theory, you get the power and speed of specialized CPU primitives, with some of the portability benefits of using C/C++ over straight assembly language. The problem in the past was that Visual Studio .NET 2003 and earlier generated poor code for these primitives that was either incorrect or slower than which could be written straight in assembly language with moderate effort.

The good news

Here's the routine using SSE2 intrinsics that I used to punish the compiler last time I wrote about this problem:

The reason for the discrepancy is that I cheated in the tests above by using the /Gr compiler switch to force the __fastcall calling convention. Part of the problem with the VC++ intrinsics is that they have a habit of forcing an aligned stack frame if any stack parameters are accessed in a function that uses intrinsics, even if no aligned parameters are necessary. This is unfortunate as it slows down the prolog/epilog and eats an additional register. Sadly, this is not fixed in Whidbey, although it is a moot point on AMD64 where the stack is always 16-byte aligned. Using the fastcall convention can fix this on x86 if all parameters can be pushed to registers, but this isn't possible if you have more than 8 bytes of parameters.

The other bad news is that the MMX instrinsics still produce awful code, although this is only pertinent to x86 since the AMD64 compiler doesn't support MMX instructions, and at least the bugs with MMX code moving past floating-point or EMMS instructions have been fixed:

The aligned stack frame is a bummer for codelet libraries, but it isn't so big of a deal if you can isolate intrinsics code into big, long-duration functions and the function isn't under critical register pressure. The improvements to SSE2 intrinsics code generation make them more attractive in Whidbey, but since AMD64 is not widespread and SSE2 is only supported on Pentium 4, Pentium M, and Athlon 64 make them unusable for mainstream code on x86. They're also rather difficult to read compared to assembly code. I still don't think I'd end up using them even after Whidbey ships, because it would make my x86 and AMD64 codebases diverge farther without much gain.

Another problem is that although all SSE2 instructions are available through intrinsics, and many non-vector intrinsics have been added in Whidbey, there are still a large number of tricks that can only be done directly in assembly language, many of which involve extended-precision arithmetic and the carry flag. The one that I use all the time is the split 32:32 fixed-point accumulator, where two 32-bit registers hold the integer and fractional parts of a value. This is very frequently required in scaling and interpolation routines. The advantage is that you can get to the integer portion very quickly. In x86:

add ebx, esi
adc ecx, edi
mov eax, dword ptr [ecx*4]

In AMD64 you can sometimes get away with half the registers if you only need a 32-bit result, by swapping the low and high halves and wrapping the carry around:

add rbx, rcx
adc rbx, 0
mov [rdx], ebx

Compiler intrinsics don't let you do this.

Another problem I run into often in routines that are MMX or SSE2 heavy is a critical shortage of general purpose registers, usually for scanline pointers, fixed-point accumulators, and counters. The way I get around this on x86 is to make use of the Structured Exception Handling (SEH) chain to temporarily hold the stack pointer, freeing it for use as an eighth general purpose register:

...and then be really careful not to cause an exception while within the block.

This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though.

7 comments | Apr 16, 2005 at 16:37 | default

Comments

Comments posted:

Damn, I was hoping it'd be the formal challenge to the death option. Phaeron taking on the compiler team ala The Bride vs. the crazy 88s.

Why'd you say using intrinsics would make the x86 and AMD64 codebases diverge farther? Wouldn't it do the opposite, avoiding the need for seperate AMD64 assembly code?

Andrew Dunstan - 21 04 05 - 18:10

The official word from Microsoft is that the x87/MMX register file is banned and that only SSE/SSE2 should be used in x64 code. This means that code that is optimally written in MMX or integer SSE must be rewritten into SSE2. The instructions are so similar that it is possible to use shared assembly code for both with some simple macros. The intrinsics, however, differ more significantly. The MMX intrinsics have both asm-like names and generalized names, but the SSE2 ones only have the generalized names, and those are rather unreadable. It is possible to wrap them in much cleaner operator overloads but my experience with VS2003 was that doing so was a magnet for intrinsics optimization bugs.

The original rumor was that Windows x64 wouldn't even save and restore the x87 register file, which made no sense because it'd be a security hole and would be saved anyway by FXSAVE/FXRSTOR; experiments show that the OS does save x87 and I saw an MS blog a while back that implied that it was OK to use it, just not recommended. There are some cases where the additional parallelism in SSE2 cannot be used and the additional execution loads generated by SSE2 are a waste, such as texture mapping. P4 is issue bottlenecked so this isn't much of an issue, but Athlon and P-M break SSE2 ops into two 64-bit ops and have smaller schedulers.

Phaeron - 23 04 05 - 01:27

I see. Got another question for you (I could probably go on all day, but Iíll try not to): What do you propose is the best way to save the non-volatile xmm registers when they need to be used? It used to be only the low 64-bits that needed to be saved, but that seems to have changed to include the whole register. Using the stack seems to require a lot of extra work.

Andrew Dunstan - 23 04 05 - 12:47

I used to stack XMM registers together using MOVLHPS/MOVHLPS, but I think you have to store them on the stack. The reason is that register usage is much more important on AMD64 and the unwind is table-based. If you don't, floating-point values may be trashed after an exception unwind.

Setting up a prolog and epilog manually in AMD64 so you can stack registers is a bit of a pain but can be wrapped in macros. The prolog may be slightly suboptimal due to a lack of scheduling with surrounding code, but the epilog must have one of two specific forms anyway as it is parsed directly.

Phaeron - 23 04 05 - 16:33

The prolog and epilog stuff I understand; it's the function table entry stuff (PDATA and XDATA) that's a bit confusing.

Andrew Dunstan - 23 04 05 - 20:48

Use the % operator to do a remainder.

Yuhong Bao - 21 10 07 - 15:56

"This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though."
And on AMD64 it is not necessary anyways as it has 8 more registers.

Yuhong Bao - 04 07 08 - 02:44

Comment form

Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.

Name:

Remember personal info?
YesNo

Email (Optional):

Your email address is only revealed to the blog owner and is not shown to the public.

URL (Optional):

Comment:

/

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.

Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.