something to look out for if typecasting also since it does not change the underlaying type so results might not be what one would expect.Hmm one thing that has allways baffled me is byte types or unsigned char, do they also fall back to int or is there more to them ?.range from experience is 0 - 255 and theres actually a rather nasty bug in the vanilla doom3 font code that assumes a negative byte value,which every static analyzing tool i have tried on it screams to me that byte values cannot be negative. funny enough most compilers let it pass anyway heh.

in a for loop its allways 0 if for (i = 0; i < something; i++ or ++i) unless you do things a bit different so as one can see this can easily fool an unexperienced programmer into beliving that the loop counter starts at 1 and not 0

One should think but i actually seen devs who made that mistake more times than you think hehe.Theres a fine example of it in the vanilla doom3 sources just try to run clang on it as a static code analyzer and get the popcorn .Honest mistake though since the dev seems to have just copy pastaed the code piece in question but forgot to check the return .

Memory references should be expected to be slower than (fairly fast) register operations. a bitshift will take only one clock, while memory references can easily take many many more.Memory access is FAR from free - especially if it means that other locals can no longer be stored in a register either.The memory accesses also mean that the compiler probably doesn't know whether the value actually changes, so any loop unrolling optimisations go out of the window too.Also, your code violates C's strict aliasing rules.

Spike wrote:a bitshift will take only one clock, while memory references can easily take many many more.

The thing is, I couldn't find any C article detailing how many clock cycles it can take to read referenced values. Only x86 Assembly articles, which I haven't learned.

Spike wrote:Memory access is FAR from free - especially if it means that other locals can no longer be stored in a register either.The memory accesses also mean that the compiler probably doesn't know whether the value actually changes, so any loop unrolling optimisations go out of the window too.

Well, I tried using it in the unrolled inner loops of the rasterizer, to reduce the amount of bitshifting performed in the texture coordinates.The inner loop is manually unrolled to groups of 16 pixels at once, which means that there were over 32 bitshifting operations (in 16 s coordinates and 16 t coordinates) eliminated through this method. I hoped that this would at least make the code more compact and less prone to cache misses, maybe helping the performance.

Spike wrote:Also, your code violates C's strict aliasing rules.

Which is why I've called it a trick. Reading a signed int through an unsigned short is risky, but the texture coordinates are always ensured to be positive anyway.

At first I had thought of using an union instead of a pointer, but then I've found out that there would be overhead because the union would have to use a struct, which would result in the memory address of the unsigned short having to be calculated from a "s2" offset { int i; struct {unsigned short s1, s2;};}. Using a separate short pointer, the offset can be precomputed.

well, modern cpus have tricks that allow the cpu to do all sorts of things asynchronously, which allows certain things to take less than a clock if its simple enough for the instruction decoding to rearrange things. memory references are far from simple, so try to prevent the compiler from spilling locals to stackcpus have limited associativity, which means that if you start reading/writing some new location, the cpu will forget somewhere else and re-accessing somewhere that was forgotten will incur a stall as it repopulates from a higher level cache. if that means that it needs to re-read from system ram, then expect a large stall. If you have two CPUs/GPUs trying to write the same memory region then you'll find that they will constantly purge the other's cache of that region (the alternative is worse...).

regarding structs/unions, I wouldn't worry about the offset. if its on the stack then esp+(offset) is identical to esp+(offset+2). globals don't need the esp part, of course. either way you no longer need a separate register to hold the pointer's value, so more registers for your actual maths.

remember that the x86 only has 8 registers (amd64 raises that to 16). and many of them are reserved for specific purposes. So if you need to hold many variables at a time, you'll end up spilling all your variables to stack, and now each reference to those variables will need load+store operations too. constants can often be embedded in the instructions themselves, which helps reduce cache misses etc.

c99's __restrict keyword might be useful to you, as it allows greater freedom by letting the compiler know that writes will not change other memory addresses. this isn't normally an issue for locals if nothing needed an address to them, but your code decided to take an address to it, and now it might need to read the pointer and THEN dereference it as two separate operations. and only then can it read from it.

Or something.

If this stuff is important to you, you should really try and figure out how to use cachegrind - http://valgrind.org/docs/manual/cg-manual.htmlFor raw clock costs, you should use something like gprof (which requires compiler instrumentation). Expect different results on different cpus (especially but not just with different instruction sets), or even with each run (interrupts etc will flush caches, which will affect reported costs).