1. Open your file for OVERLAPPED and also for NO_BUFFERING. Buffering helps reads, but it severely hurts writes. It takes my
write speeds from 80 MB/sec down to like 30 MB/sec. I suspect that the reason is that buffering causes the pages to be read in
before they are written out. (it's sort of like using cached memory - it will fetch in the pages even though you're doing nothing
but stomping all over them).

2. Use the undocumented NtSetInformationFile to resize the file to its full size before you write anything. SetEndOfFile will only work for page size
granularity, NtSetInformationFile can do arbitrary sizes. BTW this is also better for fragmentation than just writing lots of data
onto the end of the file, which can cause NTFS to give you lots of allocation chains.

3. Use SetFileValidData to tell Windows that whatever is in the sectors on disk is okay. If you don't use SetFileValidData, Windows
will first zero out each sector before you get to touch it. This is like a security thing, but obviously it's pretty bad for perf to
basically write the whole file twice. SetFileValidData will fail unless you first ask for your process to get the right permissions,
which of course will only work for processes running as administrator. Okay, I did all that.
this post is okay but dear god
don't read the thread.

If you do all those things right - your WriteFile() calls will actually be asynchronous. The actual speed win is not huge. Part of
the problem is the next issue :

When I do all that, I start hitting some kind of weird OS buffer filling issue. I haven't tried to track down exactly what's happening
because I don't really care that much, but what I see is that the writes are totally async and very fast (> 100 MB/sec) for some random
amount of time (usually up to about 10 MB of writing or so) and then suddenly randomly start having huge delays. The write speed then
goes down to 40 MB/sec or so.

ADDENDUM : when I say "you should not even try" I mean you should just live with WriteFile() being synchronous. It's plenty fast.
Just run it from a thread and it still looks async to your thread (you need a thread anyway because OpenFile and CloseFile are very
slow and synchronous; in fact the only thing you can rely on actually being fast and async is ReadFile).
Also just live with the fact that Windows is zeroing the disk before you write it, everyone else does.

1. Manually walking ebp/esp ; this steps back through the frame pointers, it relies on the callers stack being pushed.
This is basically what RtlCaptureStackBackTrace or DmCaptureStackBackTrace does, but you can also just write it
yourself very easily. The advantage of this is it's reasonably fast. The disadvantage is it doesn't work on all CPU
architectures, and it doesn't work with the frame pointer omission optimization.

2. StackWalk64. This is the new API you're supposed to use. The advantage is it works on all CPUs and it even
works with frame pointer omission (!). But you can see from that latter fact that it must be very slow. In order
to work with FPO it loads the PDB and uses the instruction pointer map to figure out how to trace back. It also
can trace through lots of system calls that normal ebp-walking fails on.

3. Manual push/pop in prolog/epilog. Uses the C compiler to stick a custom enter/leave on every function that does a push & pop to
your own stack tracker. Google Perftools has an option to work this way. The "MemTracer" project works this way (more on MemTracer some day).
The nice thing about this is it works on any architecture as long as the prolog/epilog is supported. The disadvantage is it adds a
big overhead even on functions that you never trace. That rather sucks. Stacktraces are very rare in my world, so I want to pay
the cost of them only when I actually do them, I don't want to be pushing & popping stack info all the time.

1/28/2009

So we have a thing to track memory allocs with a stack trace, la di da, no big whoop. I log it all out to a file.
So I wrote a thing to parse them into a hierarchy and spit them out with tabs for
tabview . That's awesome.

Then I thought, hey, Atman makes these awesome graphs with "graphviz" so maybe I'll try that. One disadvantage of the pure hierarchy view
in tabview is that you can't really see the flow when lines merge back up. That is, call graphs are not *trees* they're *DAGs*. Sometimes
the stack hierarchy forks apart but then comes back together again. Graphviz should be able to show this neatly. Graphviz makes "SVG" files
that you can just load with Firefox (Firefox 3's support is much better than 2's).

Anyway I made some graphs with various options and it's kinda cool. Here's an example :
Allocs1.svg (you'll need to use ctrl-wheel to zoom out to see anything).
Apparently if you click this link Firefox does nothing good. You have to
download it - then open it with Firefox. WTF. Apparently it's my Verio servers
doing the wrong thing . Yegads I hate the web.

Not bad, but I'm a little disappointed with graphviz's layout abilities. In particular the actual cell and edge layout seems very good,
but they seem to have literally zero code to try to put the labels in good places.

In a bit of odd deja vu, this was one of the very
first things I ever worked on as a professional programmer in 1991; I worked for a GIS company that had an old COBOL GIS database engine that
worked with Census data and such; I wrote them a new C front end with graphics for PC's and such, and one of the things you have to do is
take this raw street data with lat/long coordinates and street names and do some nice munging to make it look okay; a big part of that is
a lot of heuristics about putting the labels for streets in good places (and when to repeat the label, etc.).

ADDENDUM : talking to Sean I realized you really want the graph to have different sizing/color options, be hierarchical, interactive, and stable.

That sounds hard, but I don't think it actually is. The key thing that makes it easy is that there is very good hierarchy in this information,
and you can create the graph incrementally. I think that means you can just use simple iterative penalty methods to make the graph stable.

Here's my proposal :

Start with the graph at very coarse granularity ; maybe directory granularity of you have a few directories and that makes sense, else file
granularity. Whatever coarse level so you have < 32 nodes or so. Just use a solver like graphviz to make this initial graph.

Now, interactively the user can click any group to expand its hierarchy. When that happens the big cell splits into various pieces. You
just create the new pieces inside the parent and make the new edges - and then you just let them time evolve with a semi-physical iterative
evoluton.

You apply a penalty force for intersection with neighbors to drive the nodes apart so there's no overlap. You similarly apply forces with the
edges to make them never intersect edges. And the edges also act kind of like springs, applying forces to try to be short and straight.
Stable 2d physics is a pretty solved problem so you just let them run until they settle down. Note that as they spread apart they can force
the other nodes in the graph to move around, but it's all nice and smooth and stable.

I think it's much easier to treat the canvas as just infinitely large and let your nodes move apart all they need to. Graphviz does everything
oriented towards being printed on a page which is not necessary for the interactive view.

1/27/2009

I've been working on some memory allocator related junk recently (just for laughs I made my SmallAllocator in cblib work lock free and timed it
running on a bunch of threads; an alloc takes around 200 clocks; I think there may be some cache contention issues). Anyway it reminded me of
some funny stuff :

Munch's Oddysee was a wild painful birth; we were shipping for Xbox launch and crunching like mad at the end. We had lots of problems to deal with, like
trying to rip as much of the fucking awful NetImmerse Engine out as possible to get our frame rate up from 1 fps (literally 6 months before ship we had areas
of the game running at 1 fps) (scene graphs FTL). And then there were the DMusic bugs. Generally I found that dealing with XBox tech support guys, the graphics and
general functionality guys have been really awesome, really helpful, etc. Eventually we trapped the exception in the kernel debugger and it got fixed.

Anyhoo... in the rush to ship and compromise, one of the things we left out was the ability to clean up our memory use. We just leaked and fragmented
and couldn't release stuff right, so we could load up a level, but we couldn't do a level transition. Horrible problem you have to fix, right? Nah,
we just rebooted the Xbox. When you do a level transition in Munch, you might notice the loading screen pops up and is totally still for a few seconds,
and then it starts moving, and the screen flashes briefly during that time. That's your Xbox rebooting, giving us a fresh memory slate to play with.
(shortly after launch the Xbox guys forbid developers from using this trick, but quite a few games in the early days shipped with this embarassing
delay in their level load).

For Stranger's Wrath we used my "Fixed Restoring" Small Allocator, which is a standard kind of page-based allocator. We were somewhat crazy and used a
hefty amount of allocations; our levels were totally data driven and objects could be different types and sizes depending on designer prefs, so we
didn't want to lock into a memory layout. The different levels in the game generally all get near 64 MB, but they use that memory very differently.
We wanted to be able to use things like linked lists of little nodes and STL hashes and not worry about our allocator. So the Small Allocator provides
a small block allocator with (near) zero size overhead (in the limit of a large # of allocations). That is, there are pages of some fixed size,
say 16 KB, and each page is assigned to one size of allocation. Allocations of that size come out of that page just by incrementing a pointer, so
they're very fast and there's zero header size (there is size overhead due to not using a whole page all the time).

That was all good, except that near the end when all the content started coming together I realized that some of our levels didn't quite fit no matter
how hard we squeezed - after hard crunching they were around 65 MB and we needed to get down to 63 or so,
and also the Fixed Restoring Allocator was wasting some space due to the page granularity. I was assigning pages to each size of
allocation, rounding up to the next alignment of 8 or 16 or 32. Pages would be given out as needed for each size bucket. So if a size bucket was never used,
pages for that size would never be allocated.

This kind of scheme is pretty standard, but it can have a lot of waste. Say you allocate a lot of 17 byte items - that gets rounded up to 24 or 32
and you're wasting 7 bytes per item. Another case is that you allocate a 201 byte item - but only once in the entire game ! You don't need to give
it a whole page, just let it allocate from the 256-byte item page.

Now in a normal scenario you would just try to use a better general purpose allocator, but being a game dev near shipping you can do funny things.
I ran our levels and looked at the # of allocations of each size and generated a big table. Then I just hard-coded the Fixed Restoring allocator
to make pages for exactly the sizes of allocations that we do a lot of. So "Stranger's Wrath" shipped with allocations of these sizes :

(The 116, 136, 156, and 272 are our primary GameObject and renderable types). (yes, we could've also just switched those objects to a custom
pool allocator for the object, but that would've been a much bigger code change, which is not something I would want to do very close to shipping
when we're trying to be in code lockdown and get to zero bugs).

There is a misconception being widely spread that x86 can reorder reads (Herb Sutter and lots of good people have been
repeating this). So far as I can tell that's just not true.
The IA-32 spec says that writes don't move past writes nor do reads move past reads. (there are lots of other constraints
in there).

x86 plain old load and store are acquire and release. Note that that does NOT mean they have a total order
(though if your code is written right for acquire/release semantics you don't need a total order).
However volatile (locked) ops on x86 do have a total order.

BTW I am very much not suggesting that you or I write code that relies on the quirks on the x86. It's better to be very careful and generally correct and mark all the places that you
are assuming different types of sequencing. If those sequencing commands turn into NOPs on x86, then bully for you. Still, it helps me to actually
know what's going on in our systems, and if we're going to say things, let's try to say them right; my god there's so much completely wrong
information about this stuff out there. This Gamasutra article is just
chock-full of wrong.

Also, Bartosz in the famous post where he gets it wrong
talks about the different memory model constraints. One is :

# memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won�t be moved before it (yes, even that is not guaranteed on all platforms!).

This is for the dependent read case. This is the same place that Linux uses "smp_read_barrier_depends". The typical scenario is like :

Note that Bartosz says this in a funny way. The issue is not that the compiler or the CPU reorder buffer can move the dependent read before the pointer read. Obviously that's impossible.
The issue is that for purposes of *memory timing* it could look as if the dependent read was moved earlier. Say I write this bad code :

Obviously there's no way to execute this out of order, the chip and the compiler are not *broken*. But it can look like it went out of order if your cache architecture allows it.

Say some other thread wrote s_shared_ptr->data , then s_shared_ptr. It used a Release so they went to the bus in order. But your chip is crazy dumb and loads cache lines in random order.
Your chip reads the line with s_shared_ptr in it, and then your code runs :

Object * ptr = s_shared_ptr;
int local = ptr->data;

What you see is the *new* value of s_shared_ptr , but the *old* value of ptr->data. Now your dumb cache pulls in ptr->data but it's too late. We see that our code *acted* like it read ptr->data
before ptr.

Fortunately this doesn't happen on modern chips (the Alpha is the only chip I know of that can do this). However to be really correct about marking up your code's memory model semantically
you should include the memory_order_consume or smp_read_barrier_depends. Then your compiler can turn those into NOPs ;)

Now it's a little bit of a mystery to me exactly how processors manage this. I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.

BTW I really don't like the idea of a C++ atomic<> class, or the Magic Microsoft Volatile. The problem with both of those is they hide where
the code is doing really crucial synchronization things. You can have code that looks like :

newnode->next = node->next;
node->next = newnode;
return node->data;

and it's secretely using atomic<> or volatile and it only works because it's relying on those to do the right acquire/release stuff. Dear god.

The really scary thing about both of those is that they are deceptively magical. They work very neatly. Most of the time. And they make it
very easy for any programmer who doesn't know anything to go ahead and write some lock free code. Yikes.

I would much rather see everyone continue to use locks, and when you really do need to do lock-free stuff, the C++0x proposal for specifically
marking up the semantics needed with memory model constraints is pretty good IMO. It clearly marks what the code requires and what the
ordering constraints are.

To say this more concisely : I'd rather see atomic<> functions in the language rather than atomic data types. Because it's really the operations
that are "atomic" or not. But it looks like I won't get my way and we're all going to be smothered by an avalanche of broken thready code in the
next 10 years.

1/26/2009

I'm going to update Part 2 with some fixes and I'd like to write a Part 3 about some actual lock-free madness, but those may have to wait a
tiny bit. In the mean time I thought I would post some links.

These articles have a good discussion of the memory models of various processors ; in particular Part 1 has a nice table that shows what each
processor does (warning : there may be some slight wrongness in here about x86).

Here come a mess of links about memory models and barriers and what's going on. There's a lot of contradiction (!!) so be careful :
For example - does LFENCE do anything at all on x86 for normal aligned integral reads? good question... btw SFENCE definitely does do nothing
(I think?).

And now some links about some real mad shit. Do not try any of this at home. I've found two mad geniuses :
Chris Thomasson ("the AppCore guy") who has got some nutty x86 lock-free synchronization stuff that relies on details of x86 and does safe communication
with zery fencing, and Dmitriy V'jukov ("the Relacey guy") who has written a ton of nutty awesome stuff for Intel.

It's a horrible name, it's not "Threading Building Blocks" at all. It's fucking OpenMP. It's not really a lib of little threading helpers, which would
be handy, it's like fucking CUDA for CPUs.

Also just to repeat - I am in no way recommending that anybody go down this path. As I have said repeatedly - CRITICAL_SECTION is really fucking fast
and smart and does exactly what you want. Just use it. If critical section locks are giving you problems it's because your architecture is wrong,
not because the micro-routine is not fast enough.

For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it
work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during
CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in
CINIT :

The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread
safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))

Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we
have to worry about thread-safing the construction.

The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :

Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the
critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by
moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before
the stuff inside Object is written. What did we learn in Part 1?

This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again
to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is
written atomically.

ADDENDUM : I need to add some more notes about this. See comments for now.

Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check
if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :

This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine,
it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.

Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT
mumbo jumo ?

It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of
spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different
execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.

Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for
dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad
cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one
(this prevents a few runaway threads from killing the system).

One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because
more about this in a moment.

which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :

Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
(so how could you beat it by using InterlockedIncrement instead?)
When there is contention and the Crit Sec does something more serious (use an Event) it's slow
but that is exactly what you want !!

In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes
do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).

Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days
is "multiprocessor" (multicore). That's a big difference from threading on a single core.

If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and
memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time.
With multiple cores memory can be touched by some other core that is running along at full speed.

Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a
way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at
you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times.
1000 times is a microsecond

What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock.
Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

1/25/2009

Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and
you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on
with the memory, as opposed to just using the language/OS constructs for safe sharing.

Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of
the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you
do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for
huge bugs.

I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction :
Concurrency What Every Dev Must Know About Multithreaded Apps .
I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly
what's going on with that.

First of all, let me try to list the various issues we're dealing with :

1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but
it's what most people think of as threading.

2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views
of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.

3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even
if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write
gather buffers, etc.)

4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like
"inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become
multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.

5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.

6. The compiler/optimizer caching values or eliminating ops

I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler
than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things
that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#.
I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.

Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like
this .
Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example
this Julian Bucknall guy has a lot of long articles about lock free stuff,
and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from
a place like "codeproject.com" or anybody's random blog.

I've read a lot of stuff like :

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.

To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language
natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do
everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C,
but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution
model" because it includes things like the concept of what's atomic and what can be reordered).

Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical
sections now, what kind of threading stuff you can do without critsecs, etc.

Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a
specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc.
but forget you know that and don't rely on it).

Let's look at a simple example and see if we can get it right.

Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :

It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned
before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all
done with. We can start trying to fix it by doing :

1. MyStruct * temp = new MyStruct( argc, argv );
2. g_work = temp;

that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the
writes to memory done by line 1 are actually finished before line 2 executes.

MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually
does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier
instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.

MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :

A
B
C

you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :

A
B
MemoryBarrier
C

You get an order constraint : C is always after {AB} , so it might be ABC or BAC.

Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with
_ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory
fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually
want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm,
the compiler automatically treats that as a ReadWriteBarrier (??).

Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :

What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on
64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.

One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform
and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for
here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or
write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!).
Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.

One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:

volatile char * var;
// this is a non-volatile pointer to a volatile char !! probably not what you meant
char * volatile var;
// this is a volatile pointer to chars - probably what you meant
// (though of course you have to make sure the char memory accesses are synchronized right)

Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable
in its meaning, and instead you should manually specify the operations that need to be sync'ed :

The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op.
The Interlocked functions also automatically generate memory barriers and optimizer barriers
(!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get
written to temp first.

Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see
is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More
formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move
after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).

So if you have :

A
B
C - Release
D

The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a
full barrier so they give the processor more room to wiggle, which is good for it.

So lets do another example of this simple thread passing (stolen from comp.programming.threads) :

Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get
set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the
compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should
know the actual execution of function 3 could go something like :

fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a

which is now reading a and b before 'valid'. Acquire stops this.

Some good links on this basic memory barrier stuff : (read Kang Su in particular)

BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version.
In summary :

ReadBarrier/WriteBarrier - just prevents compiler reordering
MemoryBarrier() - CPU memory load/store barrier
Interlocked & volatile
Interlocked functions on Windows automatically generate a compiler & a CPU barrier
Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it
special volatile thing in MSVC ?
volatile automatically does the right thing on VC >= 2005
not for XBox though

ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called
volatile vs volatile . It covers a lot of this same territory.

And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")

Complexification Gallery - wow really gorgeous images. All made algorithmically with
Processing, and most driven by semi-physical-mathematical models. Lots of applets to play with. This is seriously fucking amazing and inspiring.

Mischief and Ivo Beltchev
have some crazy debugger database plugin thing for the string-CRC model. IMO this is fucking nuts. But I do love me
some autoexp.dat

We were talking about this the other day and I realized I've forgotten what "Koenig lookup" is and why you need it.
Basically it just means that functions are looked up in the namespace of their arguments. So :

works, even though it's calling a func() that's not in the global namespace.

If you think about it a second this is nice, because it means that non-member functions on a class act like they are in
the same namespace as that class. This is pretty crucial for non-member operators; it would be syntactically horrific to
have to call the operators in the right namespace. But it's nice for other stuff too, it means you don't need to jam everything into
member functions in order to make it possible to hand a class out to another namespace.

At the line marked *** we're calling my_ns::operator == (Vec3,Vec3) - but how did we get to call a function in my_ns when
we're not in that namespace? Koenig lookup.

Now, this really becomes crucial when you start doing generic programming and using templates and namespaces. The reason is
your containers in the STL are in std:: namespace. You are passing in objects that are in your namespace. Obviously
the STL containers and algorithms need to get access to the operations in your namespace. The only way they can do that
is Koenig lookup - they use the namespace of the type they are operating on. For example to use std::sort and make use of your " operator < "
it needs to get to your namespace.

1/21/2009

Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually.
Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug
thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.

It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.

Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed
while the mutex is held. This is just the standard easy old way :

Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically
pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.

To keep the code the same, they both get copies of the Communication Region :

The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs
are still pending or done yet awaiting examination.

Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to
start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free.
This way 99% of the code is identical, but it still just talks to a "Communication Region".

ADDENDUM :

I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a
copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't
page pieces back to main memory as they go.

While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it
might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the
main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without
locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle
systems)

Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That
is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication
model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks
for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.

is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.

ADDENDUM #2 :

Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write.
The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free
methods safe. Let me try to make this clear with an example :

You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work.
So you have a communication region like :

// globals :
bool isDone;
int workParams[32];
int workResults[32];

Now a lot of people try to do lock-free work passing trivially by going :

Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs.
(I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and
results don't all happen together and in-order). Instead we can just pass the object :

Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written
and read in one big chunk. That is, we move the shared data from one large consistent state to another.

1/17/2009

A while ago the
Some Assembly Required blog
wrote some good notes about float-to-int. I posted some notes there but I thought I'd try to summarize my thoughts coherently.

What I'd like is a pretty fast float-to-int (ftoi) conversion. The most useful variants are "truncate" (like C, fractions go
towards zero), and "round" , that is, fractions go towards the nearest int. We'd like both to be available all the time,
and both to be fast. So I want ftoi_trunc and ftoi_round.

First let me say that I hate the FPU control word with a passion. I've had so many bugs because of that fucker over the years. I write some
code and test it and everything is fine, and then we actually put in some game and all hell breaks loose. WTF happened?
Oh well, I tested it with the default word setup, and now it's running with the FPU set to single precision. The other
classic bug is people changing the rounding mode. D3D used to be really bad about this (you could use FPU_PRESERVE but it
was a pretty big hit back in the old days with software T&L, not a big deal any more). Or even worse is people who write code intentionally
designed to work with the FPU in a non-standard rounding mode (like round to nearest). Then if you call other code that's meant for the
normal rounding mode, it fails.

Ok, rant over. Don't mess with the FPU control word.

That means the classic /QIfist really doesn't do that much for us. Yes, it makes ftoi_trunc faster :

note that a lot of people just do + 0.5 to round - that's wrong, for negatives you need to go the other way, because
the C truncation is *toward zero* not *down*.

Even if you could speed up the round case, I really don't like using compiler options for crucial functionality. I like
to make little code snippets that work the way I want regardless of the compiler settings. In particular if I make some
code that relies on ftoi being fast I don't want to use C casts and hope they set the compiler right. I want the code
to enforce its correctness.

Fortunately the xs routines at stereopsis by Sree Kotay are
really good. The key piece is a fast ftoi_round (which I have slightly rejiggered to use the union method of aliasing) :

in my tests this runs at almost exactly the same speed as FISTp (both around 7 clocks), and it always works regardless of
the FPU control word setting or the compiler options.

Note that this is a "banker's round" not a normal arithmetic rounding where 0.5 always goes up or down - 0.5 goes to the
nearest *even* value. So 2.5 goes to 2.0 and 3.5 goes to 4.0 ; eg. 0.5's go up half the time and down half the time.
To be more precise, ftoi_round will actually round the same way that bits that drop out of the bottom of the FPU registers
during addition round. We can see that's why making a banker_round routine was so easy, because that's what the FPU
addition does.

But, we have a problem. We need a truncate (ftoi_trunc). Sree provides one, but it uses a conditional, so it's slow
(around 11 clocks in my tests). A better way to get the truncate is to use the SSE intrinsinc :

Note that the similar _mm_cvt_ss2si (one t) conversion does banker rounding, but the "magic number" xs method is faster because it pipelines better,
and because I'm building for x86 so the cvt stuff has to move the value from FPU to SSE. If you were building with arch:sse
and all that, then obviously you should just use the cvt's. (but then you dramatically change the behavior of your floating
point code by making it run through float temporaries all the time, instead of implicit long doubles like in x86).

So, that's the system I have now and I'm pretty happy with it. SSE for trunc, magic number for round, and no reliance on
the FPU rounding mode or compiler settings, and they're both fast.

BTW note the ceil and floor from Sree's XS stuff which are both quite handy and hard to do any other way. Note you might think
that you can easily make ceil and floor yourself from the C-style trunc, but that's not true, remember floor is *down* even on negatives.
In fact Sree's truncate is literally saying "is it negative ? then ceil, else floor".

Finally : if you're on a console where you have read-modify-write aliasing stall problems the union magic number trick is probably not good
for you. But really on a console you're locked into a specific CPU that you completely control, so you should just directly use the
right intrinsic for you.

AND : regardless of what you do, please make an ftoi() function of some kind and call that for your conversions, don't just cast.
That way it's clear where where you're converting, it's easy to see and search for, it's easy to change the method, and if you use ftoi_trunc
and ftoi_round like me it makes it clear what you wanted.

ASIDE :
in fact I'm starting to think *all* casts should go through little helper functions to make them very obvious and clear.
Two widgets I'm using to do casts are :

1/16/2009

WTF MSVC has macro push & pop !? How did I not know this? It's so superior. It actually makes #defining new & delete actually possibly an okay option. (normally I get
sucked into a hell of having to #undef them and redef them back to the right thing)

Of course I can't use it at work where multi-platform support is important, but I can use it at home where I don't give a flying fuck about things that don't work in MSVC
and it makes life much easier.

So I just had kind of a weird issue that took me a while to figure out and I thought I'd write up
what I learned so I have it somewhere.
(BTW I wrote some stuff last year about
VirtualAlloc and the zeroer.)

The problem was this Oodle bundler app I'm working on was running out of memory at around 1.4 GB of
memory use. I've got 3 GB in my machine, I'm not dumb, etc. I looked into some things - possible
virtual address space fragmentation? No. Eventually by trying various allocation patterns I
figured it out :

dwAllocationGranularity

On Windows XP all calls to VirtualAlloc get rounded up to the next multiple of 64k. Pages are 4k - and pages
will actually be allocated to your process on 4k granularity - but the virtual address space is reserved in 64k
chunks. I don't know if there's any fundamental good reason for this or if it's just a simplification for
them to write a faster/smaller allocator because it only deals with big aligned chunks.

Anyway, my app happened to be allocating a ton of memory that was (64k + 4k) bytes (there was a texture that
was exactly 64k bytes, and then a bit of header puts you into the next page, so the whole chunk was 68k). With
VirtualAlloc that actually reserves two 64k pages, so you are wasting almost 50% of your virtual address space.

NOTE : that blank space you didn't get in the next page is just *gone*. If you do a VirtualQuery it tells you
that your region is 68k bytes - not 128k. If you try to do a VirtualAlloc and specify an address in that range,
it will fail. If you do all the 68k allocs you can until VirtualAlloc returns NULL, and then try some more 4k
allocs - they will all fail. VirtualAlloc will never give you back the 60k bytes wasted on granularity.

The weird thing is there doesn't seem to be any counter for this.
Here are the TaskMgr & Procexp reading meanings :

TaskMgr "Mem Usage" = Procexp "Working Set"

This is the amount of memory whose pages are actually allocated to your app. That means the pages have actually been touched! Note that pages
from an allocated range may not all be assigned.

For example, if you VirtualAlloc a 128 MB range , but then only go and touch 64k of it - your "Mem Usage" will show 64k. Those pointer touches
are essentially page faults which pull pages for you from the global zero'ed pool. The key thing that you may not be aware of is that
even when you COMMIT the memory you have not actually got those pages yet - they are given to you on demand in a kind of "COW" pattern.

TaskMgr "VM Size" = Procexp "Private Bytes"

This is pretty simple - it's just the amount of virtual address space that's COMMITed for your app. This should equal to the total "Commit Charge"
in the TaskMgr Performance view.

ProcExp "Virtual Size" =

This one had me confused a bit and seems to be undocumented anywhere. I tested and figured out that this is the amount of virtual address
space RESERVED by your app, which is always >= COMMIT. BTW I'm not really sure why you would ever reserve mem and not
commit it, or who exactly is doing that, maybe someone can fill in that gap.

Thus :

2GB >= "Virtual Size" >= "Private Bytes" >= "Working Set".

Okay, that's all cool. But none of those counters shows that you have actually taken all 2 GB of your address space
through the VirtualAlloc granularity.

ADDENDUM : while I'm explaining mysteriously named counters, the "Page File Usage History" in Performance tab of task manager
has absolutely nothing to do with page file. It's just your total "Commit Charge" (which recall the same as the "VM Size" or
"Private Bytes"). Total Commit Charge is technically limited by the size of physical ram + the size of the paging file.
(which BTW, should be zero - Windows runs much better with no paging file).

To be super clear I'll show you some code and what the numbers are at each step :

I'm assuming you all basically know about virtual memory and so on. It kind of just hit me for the first time, however, that our problem now
(in 32 bit aps) is the amount of virtal address space. Most of us have 3 or 4 GB of physical RAM for the first time in history, so you actually
cannot use all your physical RAM - and in fact you'd be lucky to even use 2 GB of virtual address space.

Some issues you may not be aware of :

By default Windows apps get 2 GB of address space for user data and 2 GB is reserved for mapping to the kernel's memory. You can change that
by putting /3GB in your boot.ini , and you must also set the LARGEADDRESSAWARE option in your linker. I tried this and it in fact worked just
fine. On my 3 GB work system I was able to allocated 2.6 GB to my app. HOWEVER I was also able to easily crash my app by making the kernel run
out of memory. /3GB means the kernel only gets 1 GB of address space and apparently something that I do requires a lot of kernel address space.

If you're running graphics, the AGP window is mirrored into your app's virtual address space. My card has 256MB and it's all mirrored, so as soon
as I init D3D my memory use goes down by 256MB (well, actually more because of course D3D and the driver take memory too). There are 1GB cards out
there now, but mapping that whole video mem seems insane, so they must not do that. Somebody who knows more about this should fill me in.

This is not even addressing the issue of the "memory hole" that device mapping to 32 bits may give you.
Note that PAE could be used to map your devices above 4G so that you can get to the full 4G of memory, if you also
turn that on in the BIOS, and your device drivers support it; apparently it's not recommended.

There's also the Address Windowing Extensions (AWE) stuff. I can't imagine a reason why any normal person would
want to use that. If you're running on a 64-bit OS, just build 64-bit apps.

VirtualQuery tells me something about what's going on with granularity. It may not be obvious from the docs,
but you can call VirtualQuery with *ANY* pointer. You can call VirtualQuery( rand() ) if you want to. It
doesn't have to be a pointer to the base of an allocation range. From that pointer it gives you back the base
of the allocation. My guess is that they do this by stepping back through buckets of size 64k. To make 2G of
ram you need 32k chunks of 64k bytes. Each chunk has something like MEMORY_BASIC_INFORMATION, which is about 32
bytes. To hold 32k of those would take 1 MB. This is just pure guessing.

SetSystemFileCacheSize is interesting to me but I haven't explored it.

Oh, some people apparently have problems with DLL's that load to fixed addresses fragmenting virtual memory. It's
an option in the DLL loader to specify a fixed virtual address. This is naughty but some people do it. This could make
it impossible for you to get a nice big 1.5 GB virtual alloc or something. Apparently you can see the fixed address
in the DLL using "dumpbin.exe" and you can modify it using "rebase.exe"

ADDENDUM : I found a bunch of links about /3GB and problems with Exchange Server fragmenting virtual address space. Most interestingly to me these links also
have a lot of hints about the way the kernel manages the PTE's (Page Table Entries). The crashes I was getting with /3GB were most surely running
out of PTE's ; apparently you can tell the OS to make more room for PTE's with the /USERVA flag. Read here :

I found this GameFest talk by Chuck Walkbourn :
Why Your Windows Game Won�t Run In 2,147,352,576 Bytes that covers some of these same issues. In particular he goes into detail about the AGP
and memory mirroring and all that. Also in Vista with the new WDDM apparently you can make video-memory only resources that don't take any app virtual
address space, so that's a pretty huge win.

BTW to be clear - the real virtual address pressure is in the tools. For Oodle, my problem is that to set up the paging for a region, I want
to load the whole region, and it can easily be > 2 GB of content. Once I build the bundles and make paging units, then you page them in and out
and you have nice low memory use. It just makes the tools much simpler if they can load the whole world and not worry about it. Obviously that
will require 64 bit for big levels.

I'm starting to think of the PC platform as just a "big console". For a console you have maybe 10 GB of data, and you are paging that through
256 MB or 512 MB of memory. You have to be careful about memory use and paging units and so on. In the past we thought of the PC as "so much
bigger" where you can be looser and not worry about hitting limits, but really the 2 GB Virtual Address Space limit is not much bigger (and
in practice it's more like 1.5 GB). So you should think of the PC as have a "small" 1 GB of memory, and you're paging 20 GB of data through it.

1/15/2009

I like it when allocations of size N are aligned to the next lowest power of 2 below N.

So eg. an allocation of 4000 bytes is aligned to 2048. It means you can do things like just malloc a 128-bit
vector and it's aligned and you never have to worry about it. You never have to manually ask for alignment
as long as the alignment you want is <= the size of your object (which it almost always is).

eg. if you want to malloc some MAT4x4 objects, you just do it and you know that they are aligned to sizeof(MAT4x4).

Is there any disadvantage to this? (eg. does it waste a lot of memory compared to more conservative alignment schemes?)

Also, I used to always do my malloc debug tracking "intrusively" , that is by sticking an info header at the front of the
block and allocating and bigger piece for each alloc, then linking them together. The advantage of this is that it's
very fast - when you free you just go to (ptr - sizeof(info_header)).

I think I am now convinced that that is the wrong way to do it. It's better to have a separate tracker which hashes from
the pointer address to an info struct. The big advantage of the "non-intrusive" way like this is that it doesn't change
the behavior of the allocator at all. So things like alignment aren't affected, and neither is cache usage or optimization
issues (for example if you're using a GC-type arena allocator and adjacency of items is important to performance).

In general now I'm more eager for debugging and instrumentation schemes like this which have *zero* affect on the behavior of
the core functionality, but basically just watch it from the outside.

(For example on consoles where you have 256M of memory in the real consoles and an extra 256M in the dev kits, it's ideal to
make two separate allocators, one in the lower 256 where all your real data goes and one for the upper 256 where all your
non-intrusive debug extra data goes; in practice this is a pain in the butt, but it is the ideal way to do things, so that
you have the real final version of the game running all the time in the lower 256).

1/13/2009

I just had another idea for strings that I think is rather appealing. I've ranted here before about refcounted strings and the
suckitude of std::string and bstring and so on. Anyway, here's my new idea :

Mutable strings are basically a vector< char > like std::string or whatever. They go through a custom allocator which *never frees*.
What that means is you can always just take a c_str char * off the string and hold onto it forever.

Thus the readable string is just char *, and you can store those in your hashes or whatever. Mutable string is a String thingy that
supports operator += and whatnot, but you just hold those temporarily to do edits and then grab the char * out.

So the usage is that you always just pass around char *'s , your objects all store char *'s, nobody ever worries about
who owns it and whether to free it, you can pass it across threads and not worry. To make strings you put a String on the stack
and munge it all you want, then pull the char * out and rock with that.

Obviously this wastes memory, BUT in typical gamedev usage I think the waste is usually microscopic. I almost always just read
const strings out of config files and then never edit them.

One exception that I'd like to handle better is frequently mutated strings. For example, you might have something in a spawner that
does something like this :

I don't love making names programatically like this, but lots of people do it and it's quite convenient, so it
should be supported.
With the model that I have proposed here, this would do allocs every time you spawn and memory use would increase forever. One way to fix
this is to use a global string pool and merge duplicates at the time they are converted to char *. That way you don't every increase your
memory use when you make strings you made before - only when you make new strings.

With the string pool model, the basic op becomes :

const char * StringPool::GetPermanentPointer( String & str );

in which the 'str' is added to the pool (or an existing one is found), and the char * you get back will never go away.

ADDENDUM : to be clear, this is not intended as an optimization at all, it's simply a way to make the programming easy
without being too awful about crazy memory use. (eg. not just making String a char [512])

libPNG is such overcomplicated shizzle. I could easily make a little image format that was just BMP with a LZ that was like maybe 1000
lines of code and just put it in a header, STB style. Hell I could toss in a lossy wavelet image coder for another 1000 lines of code. It
wouldn't be the best in the world, but it would be good enough and super simple. Unfortunately I guess there's no point cuz it's not a standard
and whatnot. (Using arithcoding instead of Huffman is part of what could make both of those so easy to write).

WIC is obviously a good thing in theory - it's sillythat every app has its own image readers & writers (I mean Amiga OS did this perfectly
back in 1892 so get with the last century bitches). On the other hand, the fact that it's closed source and MS-only and in fact requires
.NET 3.5 makes it pretty much ass.

A simple open-source multi-platform generic image IO library that's based on pluggable components and supports every format is such an obvious thing that we should have. It
should be super simple C. You shouldn't have to recompile apps to get new plug-ins, so it should be DLL's in a dir in Windows and whatever similar
thing on other OS'es. (but you should also easily be able to just compile all the plug-ins into your app as a static lib if you want to do that).

One thing that mucks it up is that many of the image formats allow all kinds of complicated nonsense in them, and if you want to support all
that then your API starts getting really complicated and ugly. Personally I'm inclined to only support rectangular bit-plane images (eg. N
components, each component is B bits, and they are interleaved in one of a handful of simple ways, like ABCABC or AAABBBCCC ).

All compressors unfortunately have this problem that they start off super simple but become huge messes when you add lots of features.

ADDENDUM : Oh fucking cock hole. I put a PNG loader in my image thing and now all my apps depend on libPNG.dll and zlib.dll , so I have to
distribute those with every damn exe, and worry about dll's being in path and so on. They also cause failures at startup if they're not found,
when really they should just fail if I actually try to load a PNG. (of course I could do that by doing LoadLibrary by hand and querying for the
functions, but I have to call like 100 functions to load a PNG so doing that would be a pain). Urg bother.

1/10/2009

I'm kind of excited about the possibilities for video compression with Larrabee. The chip is very powerful and flexible and obviously well
suited to video. Having that much power in the decoder would let you do things that are impossible today - mainly doing motion-comp on
the decoder side. That lets you acheive the old dream of basically not sending motion vectors at all, (or just sending corrections from
that's predicted).

In fact, it would let you send video just more like a normal predictive context coder. For each pixel, you predict a probability for
each value. That probability is done with context matching, curve fitting, motion compensation etc. It has to be reproduced in the decoder.
This is a basic context coder. These kind of coders take a lot of CPU power, but are actually much simpler conceptually and architecturally
than something like H264. Basically you are just doing a kind of Model-Coder paradigm thing which is very well understood. You use an
arithmetic coder, so your goal is just to make more accurate probabilities in your model.

Other slightly less ambitious possibilities are just using things like 3d directional wavelets for the transform. Again you're eliminating
the traditional "mocomp" step but building it into your transform instead.

Another possibility is to do true per-pixel optical flow, and frame to frame step the pixels forward along the flow lines like an incompressible
fluid (eg. not only do the colors follow the flow, but so do their velocities). Then of course you also send deltas.

Unfortunately this is all a little bit pointless because no other architecture is anywhere close to as flexible and powerful, so you would be
making a video format that can only be played back on LRB. There's also the issue that we're getting to the point where H264 is "good enough"
in the sense that you can do HD video at near-lossless quality, and the files may be bigger than you'd like, but disks keep getting bigger and
cheaper so who cares.

Most of this is related to RAW. The GUILLERMO LUIJK stuff in particular is very good. dcraw seems to be the best freeware raw importer, but god
help you working with that. UFRaw and LibRaw are conversions of dcraw into more usable forms, though they tend to lag his updates.
I've given up on WIC because I can't get it (the new Windows SDK) to install on all my machines.

The commercial RAW processors that I've looked at are so freaking slow, this is definitely a problem that could use some bad ass optimizer
programmer love. Hell even just the viewers are slow as balls.

ImageMagick is pretty cool BTW ; it's really similar to the old DOS program "Image Alchemy" which I used to use lots in the 386 days. It's all
command line so you can set up batch files to do the processing you want on various images.