I'm involved in one of those challenges where you try to produce the smallest possible binary, so I'm building my program without the C or C++ run-time libraries (RTL). I don't link to the DLL version or the static version. I don't even #include the header files. I have this working fine.

Some RTL functions, like memset(), can be useful, so I tried adding my own implementation. It works fine in Debug builds (even for those places where the compiler generates an implicit call to memset()). But in Release builds, I get an error saying that I cannot define an intrinsic function. You see, in Release builds, intrinsic functions are enabled, and memset() is an intrinsic.

I would love to use the intrinsic for memset() in my release builds, since it's probably inlined and smaller and faster than my implementation. But I seem to be a in catch-22. If I don't define memset(), the linker complains that it's undefined. If I do define it, the compiler complains that I cannot define an intrinsic function.

Does anyone know the right combination of definition, declaration, #pragma, and compiler and linker flags to get an intrinsic function without pulling in RTL overhead?

That allows your code to call memset(). In most cases, the compiler will inline the intrinsic version.

Second, in a separate implementation file, provide an implementation. The trick to preventing the compiler from complaining about re-defining an intrinsic function is to use another pragma first. Like this:

This provides an implementation for those cases where the optimizer decides not to use the intrinsic version.

The outstanding drawback is that you have to disable whole-program optimization (/GL and /LTCG). I'm not sure why. If someone finds a way to do this without disabling global optimization, please chime in.

What are all those casts doing there? Also, pointer conversions to and from void * are normally static_cast-s, not reinterpret_cast-s.
–
AnTAug 30 '10 at 0:32

@AndreyT: I've changed the cast from void * to use a static_cast. At the time I originally wrote this, which cast to use in that situation was unclear and hotly debated. (stackoverflow.com/questions/310451/…) I'm not sure what you mean about "all" those cases. There are two. The first is necessary because you cannot write via a pointer to void (which is what memset takes). The second is so that the compiler doesn't warn about assigning an int to an unsigned char.
–
Adrian McCarthyJan 14 '12 at 0:27

I'm pretty sure there's a compiler flag that tells VC++ not to use intrinsics

The source to the runtime library is installed with the compiler. You do have the choice of excerpting functions you want/need, though often you'll have to modify them extensively (because they include features and/or dependencies you don't want/need).

There are other open source runtime libraries available as well, which might need less customization.

I got your new test code to compile and link. These are the relevant settings:

Enable Intrinsic Functions: No
Whole Program Optimization: No

It's that last one that suppresses "compiler helpers" like the built-in memset.

Edited to add:

Now that it's decoupled, you can copy the asm code from memset.asm into your program--it has one global reference, but you can remove that. It's big enough so that it's not inlined, though if you remove all the tricks it uses to gain speed you might be able to make it small enough for that.

But that's working against the ultimate goal of trying to make the smallest possible binary. In many cases, including memset, the inlined intrinsic function is smaller than the function call.
–
Adrian McCarthyMay 31 '10 at 19:36

The lib version is faster just because it aligns the target pointer to 4 bytes (in 32 bits machines, 8 bytes in 64 bits) and uses rep stosd instead of rep stosb, writing separately the unaligned bytes at the start and the end. Doing that would make memset even larger. Again (as I stated in the comments to my answer) I don't think your compiler is really generating the intrinsic. Egrunin's implementation is as small as you can get. In very specific cases maybe the intrinsic would be able to spare the pushs/pops, if ecx&edi are available. Would you have a net gain? Rarely, I guess.
–
Fabio CeconelloMay 31 '10 at 21:39

The code in egrunin's second edit is essentially the same as the code generated by the compiler when it uses the intrinsic. The compiler is often able to save a few bytes when it knows that it doesn't need to preserve ecx and edi. The library version pays off when the number of bytes to clear gets larger. There's overhead in dealing with the possibly unaligned beginning and end.
–
Adrian McCarthyMay 31 '10 at 22:41

Everything you wrote it true, but it didn't really address my question. That's probably my fault for not being clear enough in the question. Turning off optimizations is counter to keeping the program small (which is why I'm trying to omit the RTL in the first place) and fast (which is a secondary goal). There doesn't seem to be a need to insert assembly into my code, when it's virtually identical to what the compiler generates. Thanks for the input.
–
Adrian McCarthyJun 1 '10 at 15:24

I think you have to set Optimization to "Minimize Size (/O1)" or "Disabled (/Od)" to get the Release configuration to compile; at least this is what did the trick for me with VS 2005. Intrinsics are designed for speed so it makes sense that they would be enabled for the other Optimization levels (Speed and Full).

Good idea, but it doesn't work. I wrote my own version, called ClearMemory() using a namespace to make sure it doesn't conflict with anything else. The optimizer replaced my implementation of ClearMemory() with a call to memset() (with a byte value of 0)! Too smart for its own good. :-)
–
Adrian McCarthyMay 31 '10 at 18:23

1

This also doesn't work if it's the compiler that uses memset in the first place (like in a class initializer).
–
romkynsJan 16 '12 at 18:13

In the specific case where you want to write zeroes, the SecureZeroMemory function seems to work. (It's implemented as a forced inline function embedded into winnt.h.)
–
Harry JohnstonJun 18 '12 at 1:16

The way the "regular" runtime library does this is by compiling an assembly file with a definition of memset and linking it into the runtime library (You can find the assembly file in or around C:\Program Files\Microsoft Visual Studio 10.0\VC\crt\src\intel\memset.asm). That kind of thing works fine even with whole-program optimization.

Also note that the compiler will only use the memset intrinsic in some special cases (when the size is constant and small?). It will usually use the memset function provided by you, so you should probably use the optimized function in memset.asm, unless you're going to write something just as optimized.