5/29/2010

Okay , MASM/MSDev support for x64 is a bit fucked. MSDev has built-in support for .ASM starting in VC 2005 which does everything for you,
sets up custom build rule, etc. The problem is, it hard-codes to
ML.EXE - not ML64. Apparently they have fixed this for VC 2010 but it is basically impossible to back-fix.
(in VC 2008 the custom build rule for .ASM is in an XML file, so you can fix it yourself thusly )

The workaround goes like this :

Go to "c:\program files (x86)\microsoft visual studio 8\vc\bin". Find the occurance of ML64.exe ; copy them to ML.exe . Now you can add .ASM files
to your project. Go to the Win32 platform config and exclude them from build in Win32.

You now have .ASM files for ML64. For x86/32 - just use inline assembly. For x64, you extern from your ASM file.

Calling to x64 ASM is actually very easy, even easier than x86, and there are more volatile registers and the convention is that caller has to
do all the saving. All of this means that you as a little writer of ASM helper routines can get away with doing very little. Usually your args
are right there in {rcx,rdx,r8,r9} , and then you can use {rax,r10,r11} as temp space, so you don't even have to bother with saving space on
the stack or any of that. See
list of volatile registers

and you can probably do better. (for example it's probably better to just define your function as returning unsigned char and then you can avoid the
movzx and let the caller worry about that)

ADDENDUM : I just found a new evil secret way I'm fucked. Unions with size mismatches appears not to even be a warning of any
kind. So for example you can silently have this in your code :

union Fucked
{
struct
{
void * p1;
int t;
} s;
uint64 i;
};

build in 64 bit and it's just hose city. BTW I think using unions as a datatype in general is probably bad practice.
If you need to be doing that for some fucked reason, you should just store the member as raw bits, and then
same_size_bit_cast() to convert it to the various types. In other words, the dual identity of that memory should
be a part of the imperative code, not a part of the data declaration.

I mentioned long ago in the low level threading articles that
some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.

The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer
and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to
implement DCAS the same way. There are various solutions to this :

1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :

1.A. There are not other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is
a very minor suckitude.

1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked.
Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on
them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit
x86 version of your app instead of the 64 bit version.

1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes.
This is part of the general bloating of 64 bit structs and mildly sucks.
(BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).

1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.

So #1 is an okay solution, but what are the alternative ?

2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on
Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer
has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know
of other than the fact that twenty years from now you'll have to touch your code again.

3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to
use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change
a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can
just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time.
This is a somewhat more complex way to do things conceptually, but it is very powerful and general,
and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have.
See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".

You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit {pointer,count} until your pointer becomes more than 52 bits,
and then switch to a pointer to extra data.

One unexpected annoyance has been that a lot of the Win32 function signatures have changed. For example LRESULT is now a pointer not
a LONG. This is a particular problem because Win32 has always made heavy use of cramming the wrong type into various places,
eg. for GetWindowLong and stuffing pointers in LPARAM's and all that kind of shit. So you wind up having tons of C-style casts when
you write Windows code. I have made good use of these guys :

BTW this all has made me realize that the recent x86-32 monotony on PC's has been a delightful stable period for development. I had almost
forgotten that it used to be always like this. Now to do simple shit in my code, I have to detect if it's x86 or x64 , if it is x64, do I have
an MSC version that has the intrinsics I need? if not I have to write a got damn MASM file. Oh and I often have the check for Vista vs. XP to
tell if I have various kernel calls. For example :

Even ignoring the pain of the last FUCK branch which requires making a .ASM file, the fact that I had to do a bunch of version/target checks
to get the right code for the other paths is a new and evil pain.

Oh, while I'm ranting, fucking MSDN is now showing all the VS 2010 documentation by default, and they don't fucking tell you what version
things became available in.

This actually reminds me of the bad old days when I got started, when processors and instruction sets were changing rapidly. You actually
had to make different executables for 386/486 and then Pentium, and then PPro/P3/etc (not to mention the AMD chips that had their own
special shiznit). Once we got to the PPro it really settled down and we
had a wonderful monotony of well developed x86 on out-of-order machines that continued up to the new Core/Nehalem chips
(only broken by the anomalous blip of Itanium that we all ignored as it went down in flames like the Hindenburg).
Obviously we've had consoles and Mac and other platforms to deal with, but that was for real products that want portability to deal with,
I could write my own Wintel code for home and not think about any of that. Well Wintel is monoflavor no more.

The period of CISC and chips with fancy register renaming and so-on was pretty fucking awesome for software developers, because you see the same
interface for all those chips, and then behind the scenes they do magic mumbo jumbo to turn your instructions into fucking gene sequences that
multiply and create bacterium that actually execute the instructions, but it doesn't matter because the architecture interface still just
looks the same to the software developer.

Eh !? Serious WTF !? I know RR_ASSERT is defined, and then it says it's not found !? WTF !?

Well a few lines above that is the key. There was this :

#undef assert
#define assert RR_ASSERT

which seems like it couldn't possibly cause this, right? It's just aliasing the standard C assert() to mine. Not possible related, right?
But when I commented out that bit the problem went away. So of course my first thought is clean-rebuild all, did I have precompiled headers
on by mistake? etc. I assume the compiler has gone mad.

Well, it turns out that somewhere way back in RR_ASSERT I was in a branch that caused me to have this definition for RR_ASSERT :

#define RR_ASSERT(exp) assert(exp)

This creates a strange state for the preprocessor. RR_ASSERT is now a recursive macro. When you actually try to use it in code, the
preprocessor apparently just bails and doesn't do any text substitution. But, the name of the preprocessor symbol is still defined,
so my ifdef check still saw RR_ASSERT existing.
Evil.

BTW the thing that kicked this off is that fucking VC x64 doesn't support inline assembly. ARGH YOU COCK ASS. Because of that we had long
ago written something like

A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is
for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find
all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do.
eg :

The problem is that DoPerObjectWork then is some conditional, maybe something like :

DoPerObjectWork(object)
{
switch(workType)
{
...
}
}

or even worse - it's a function pointer that you call back.

Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just
propagate that branch up through the loops, but there's way to express it neatly in C.

The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to
make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :

this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed
penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be
kept in sync and can be far apart in the code.

This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable
into the right template call. But it's ugly.

BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant
can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel
can have between 1 and 4 channels. Instead of this :

for(int c=0;c<channels;c++)
{
DoStuff(c);
}

where channels is a constant (template parameter), it's better to do :

5/26/2010

The correct way to cache things is through Windows' page cache. The advantage from doing this over using your own custom cache code is :

1. Automatically resizes based on amount of memory needed by other apps. eg. other apps can steal memory from your cache to run.

2. Automatically gives pages away to other apps or to file IO or whatever if they are touching their cache pages more often.

3. Automatically keeps the cache in memory between runs of your app (if nothing else clears it out). This is pretty immense.

Because of #3, your custom caching solution might slightly beat using the Windows cache on the first run, but on the second run it will
stomp all over you.

To do this nicely, generally the cool thing to do is make a unique file name that is the key to the data you want to cache. Write the data to
a file, then memory map it as read only to fetch it from the cache. It will now be managed by the Windows page cache and the memory map will just
hand you a page that's already in memory if it's still in cache.

The only thing that's not completely awesome about this is the reliance on the file system. It would be nice if you could do this without ever
going to the file system. eg. if the page is not in cache, I'd like Windows to call my function to fill that page rather than getting it from disk,
but so far as I know this is not possible in any easy way.

For example : say you have a bunch of compressed images as JPEG or whatever. You want to keep uncompressed caches of them in memory. The right way
is through the Windows page cache.

My beloved "AllSnap" doesn't work on Windows 7 / x64. I can't find a replacement because fucking Windows has a feature called "Snap" now,
so you can't fucking google for it. (also searching for "Windows 7" stuff in general is a real pain because solutions and apps for the
different variants of windows don't always use the full name of the OS they are for in their page, so it's hard to search for; fucking
operating systems really need unique code names that people can use to make it possible to search for them; "Windows" is awful awful in
this regard).

I contacted the developer of AllSnap to see if he would give me the code so I could fix it, but he is ignoring me. I can tell from debugging
apps when AllSnap is installed that it seems to work by injecting a DLL. This is similar to how I hacked the poker sites for GoldBullion,
so I think I could probably reproduce that. But I dunno if Win7/x64 has changed anything about function injection and the whole DLL function
pointer remap method.

BTW/FYI the standard Windows function injection method goes like this : Make a DLL that has some event handler. Run a little app that causes that event
to trip inside the app you want to hijack. Your DLL is now invoked in that app's process to handle that event. Now you are running in that process
so you can do anything you want - in particular you can find the function table to any of their DLL's, such as user32.dll, and stuff your own function
pointer into that memory. Now when the app makes normal function calls, they go through your DLL.

5/25/2010

I just multi-threaded my video test app recently, and it was reasonably easy, but I had a few nagging bugs because of hidden ways they
were touching shared memory without protection deep inside functions. Okay, so I found them and fixed them, but I'm left with a problem -
any time I touch one of those deep functions, I could screw up the threading without realizing it. And I might not get any indication of
what I did for weeks if it's a rare race.

What I would like is a way to make this more robust. I have very strong threading primitives, I want a way to make sure that I use them!
In particular, I want to be able to mark certain structs as only touchable when a critsec is locked or whatever.

I think that a lot of this could be done with Win32 memory page protections. So far as I know there's no way to associate protections
per-thread, (eg. to make a page read/write for thread A but no-access for thread B). If I could do that it would be super sweet.

One idea is to make the page no access and then install my own exception handler that checks what thread it is, but that might be too much
overhead (and not sure if that would fail for other reasons).

The main usage is not for protected crit-sec'ed structs, that is really the easiest case to maintain because it's very obvious right there
in the code that you need to take the critsec to touch the variables. The hard case to maintain is the ad hoc "I know this is safe to touch
without protection". In particular I have a lot of code that runs like this :

Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A
Phase 2 : fire up threads. They only read from A and do so without protection. They each write to unique areas B,C,D.
Phase 3 : spin down threads. Now main thread can write A and read B,C,D.

So what I would really like to do is :

Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A
-- set A memory to be read-only !
-- set B,C,D memory to be read/write only for their own thread
Phase 2 : fire up threads. They only read from A and do so without protection. They each write to unique areas B,C,D.
-- make A,B,C,D read/write only for main thread !
Phase 3 : spin down threads. Now main thread can write A and read B,C,D.

The thing that this saves me from is when I'm tinkering in DoComplicatedStuff() which is some function called deep inside Phase 2
somewhere and I change it to no longer follow the memory access rule that it is supposed to be following. This is just my hate for
having rules for code correctness that are not enforced by the compiler or at least by run-time asserts.

5/21/2010

In the end movec-residual coding is inherently limitted and inefficient. Let's review the big advantage of it and the big problem.

The advantage is that the encoder can reasonably easy consider {movec,residual} coding choices jointly. This is a huge advantage over
just picking what's my best movec, okay now code the residual. Because movec affects the residual, you cannot make a good R/D decision if
you do it separately. By using block movecs, it reduces the number of options that need to be considered to a small enough set that
encoders can practically consider a few important choices and make a smart R/D decision. This is what is behind all current good video
encoders.

The disadvantage of movec-residual coding is that they are redundant and connected in a complex and difficult to handle way. We send them
independently, but really they have cross-information about each other, and that is impossible to use in the standard framework.

There are obviously edges and shapes in the image which occur in both the movecs and the residuals. eg. a moving object will have a boundary,
and really this edge should be used for both the movec and residual. In the current schemes we send a movec for the block, and then the
residuals per pixel, so we now have finer grain information in the residual that should have been used to give us finer movecs per pixel, but
it's too late now.

Let's back up to fundamentals. Assume for the moment that we are still working on an 8x8 block. We want to send that block in the current frame.
We have previous frames and previous blocks within the current frame to help us. There are 256^3^64 possible values for this block.
If we are doing lossy coding, then not all possible values for the block can be sent. I won't get into details of lossiness, so just say there
are a large number of possible values for the pixels of the block; we want to code an index to one of those values.

Each index should be sent with a different bit length based on its probability. Already we see a flaw with {movec-residual} coding - there are
tons of {movec,residual} pairs that specify the same index. Of course in a flat area lots of movecs might point to the same pixels, but even
if that is eliminated, you could go movec +1, residual +3, or movec +3, residual +1, and both ways get to +4. Redundant encoding = bit waste.

Now, this bit waste might not be critically bad with current simple {movec,residual} schemes - but it is a major encumbrance if we start looking
at more sophisticated mocomp options. Say you want to be able to send movecs for shapes, eg. send edges and then send a movec on each side. There
are lots of possibilities here - you might just send a movec per pixel (this seems absurdly expensive, but the motion fields are very smooth so
should code well from neighbors), or you might send a polygon mesh to specify shapes. This should give you much better motion fields, and then the
information in the motion fields can be used to predict the residuals as well. But the problem is there's too much redundancy. You have greatly
expanded the number of ways to code the same output pixels.

We could consider more modest steps as well, such as sticking with block mocomp + residual, but expanding what we can do for "mocomp". For example,
you could use two motion vectors + arbitrary linear combination of the source blocks. Or you could do trapesoidal texture-mapping style mocomp. Or
mocomp with a vector and scale + rotation. None of these is very valuable, there are numerous problems : 1. too many ways to encode for the encoder
to do thorough R/D analysis of all of them, 2. too much redundancy, 3. still not using the joint information across residual & motion.

In the end the problem is that you are using a 6-d value {velocity,pixel} to specify a 3-d color. What you really want is a 3-d coordinate which
is not in pixel space, but rather is a sort of "screw" in motion/pixel space. That is, you want the adjacent coordinates in motion/pixel space to
be the ones that are closest together in the 6-d space. So for example RGB {100,0,0} and {0,200,50} might be neighbors in motion/pixel space if
they can be reached by small motion adjustments.

Okay this is turning into rambling, but another way of seeing it is like this : for each block, construct a custom basis transform. Don't send
a separate movec or anything - the axes of the basis transform select pixels by stepping in movec and also residual.

ADDENDUM : let me try to be more clear by doing a simple example. Say you are trying to code a block of pixels which only has 10 possible
values. You want to code with a standard motion then residual method. Say there are only 2 choices for motion. It is foolish to code
all 10 possible values for both motion vectors! That is, currently all video coders do something like :

Clearly this is foolish. For each movec, you only need to code the residual which encodes that resulting pixel block the smallest under that
movec. So you only need each output value to occur in one spot on the tree, eg.

0 - [0,1,2,3,4]
*<
1 - [5,6,7,8,9]

or something. That is, it's foolish to have to ways to encode the residual to reach a certain target when there were already cheaper ways to reach
that target in the movec coding portion.
To minimize this defficiency, most current coders like H264 will code blocks by either putting almost all the bits in the movec
and very few in the residual, or the other way (almost none in the movec and most in the residual). The loss occurs most when you have many bits
in the motion and many in the residual, something like :

0 - [0,1,2]
1 - [3,4,5,6]
2 - [7,8]
3 - [9]

The other huge fundamental defficiency is that the probability modeling of movecs and residuals is done in a very primitive way based only on
"they are usually small" assumptions. In particular, probability modeling of movecs needs to be done not just based on the vector, but on the
content of what is pointed at. I mentioned long ago there is a lot of redundancy there when you have lots of movecs pointing at the same thing.
Also, the residual coding should be aware of what was pointed to by the movec. For example if the movec pointed at a hard edge, then the
residual will likely also have a similar hard edge because it's likely we missed by a little bit, so you could use a custom transform that handles
that better. etc.

ADDENDUM 2 : there's something else very subtle going on that I haven't seen discussed much. The normal way of sending {movec,residual} is
actually over-complete. Mostly that's bad, too much over-completeness means you are just wasting bits, but actually some amount of over-completeness
here is a good thing. In particular for each frame we are sending a little bit of extra side information that is useful for *later* frames.
That is, we are sending enough information to decode the current frame to some quality level, plus some extra that is not really worth it for
just the current frame, but is worth it because it helps later frames.

The problem is that the amount of extra information we are sending is not well understood. That is, in the current {movec,residual} schemes we
are just sending extra information without being in control and making a specific decision. We should be choosing how much extra information to
send by evaluating whether it is actually helpful on future frames. Obviously the last frames of the video (or a sequence before a cut) you
shouldn't send any extra information.

In the examples above I'm showing how to reduce the overcomplete information down to a minimal set, but sometimes you might not want to do
that. As a very course example say the true motion at a given pixel is +3, movec=3 to get to final pixel=7 , but you can code the same result
smaller by using movec=1 - deciding whether to send the true motion or not should be done based on whether it actually helps in the future,
but more importantly the code stream could collapse {3,7} and {1,7} so there is no redundant way to code if the difference is not helpful.

This becomes more important of course if you have a more complex motion scheme, like per-pixel motion or trapezoidal motion or whatever.

5/20/2010

Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being
developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall
to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in
the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff",
it's mostly "do the same thing but use more CPU".

H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas
that are outside of this kind of scheme. Some quick notes on the kind of things we may see :

Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take
a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying
source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you
do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there
will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target
frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel
mocomp from that.

Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with
more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition.
This lets the block transform match the data better. Which also leads us to -

Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big
win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned
along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the
transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option
is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and
use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the
same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on
those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb.
These techniques can also be used for ...

Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you
don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities).
Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is
basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There
are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair
game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds
a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.

Unblocking filters ("loop filtering" huh?) are an obvious area for improvement. The biggest area for improvement is deciding
when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter
has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream
was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve
(eg. it could have but chose not to).

Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency
response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg.
oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous
frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some
assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters
correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.

Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's
(very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW
or EZDCT.

5/19/2010

If it in fact becomes a clean open-source video standard with no major patent encumbrances, it might be well integrated in Firefox,
Windows Media, etc. etc. - eg. we might actually have a video format that actually just WORKS! I don't even care if the quality/size is
really competitive. How sweet would it be if there was a format that I knew I could download and it would just play back correctly and
not give me any headaches. Right now that does not exist at all. (it's a sad fact that animated GIF is probably the most portable video
format of the moment).

Now, you might well ask - why VP8 ? To that I have no good answer. VP8 seems like a messy cock-assed standard which has nothing in
particular going for it. The entropy encoder in particular (much like H264) seems badly designed and inefficient.
The basics are completely vanilla, in that it is block based, block modes, movecs, transforms, residual coding.
In that sense it is just like MPEG1 or H265. That is a perfectly fine thing to do, and in fact it's what I've wound up doing, but you
could pull a video standard like that out of your ass in about five minutes, there's no need to license code for that. If in fact VP8 does
dodge all the existing patents then that would be a reason that it has value.

The VP8 code stream is probably pretty weak (I really don't know enough of the details to say for sure).
However, what I have learned of late is that there is massive room for the encoder to make good output video even through a weak
code stream. In fact I think a very good encoder could make good output from an MPEG2 level of code stream.
Monty at Xiph has a nice page about work on Theora. There's nothing
really cutting edge in there but it's nicely written and it's a good demonstration of the improvement you can get on a fixed standard
code stream just with encoder improvements (and really their encoder is only up to "good but still basic" and not really into the realm
of wicked-aggressive).

The only question we need to ask about the VP8 code stream is : is it flexible enough that it's possible to write a good encoder for it
over the next few years? And it seems the answer is yes. (contrast this to VP3/Theora which has a fundamentally broken code stream which
has made it very hard to write a good encoder).

ADDENDUM 2 : Something major that's been missing from the web discussions and from the literature about video for a long
time is the separation of code stream from encoder. The code stream basically gives the encoder a language and framework to work in.
The things that Jason / Dark Shikary thinks are so great about x264 are almost entirely encoder-side things that could apply to almost any
code stream (eg. "psy rdo" , "AQ", "mbtree", etc.). The literature doesn't discuss this much because they are trapped in the pit of PSNR
comparisons, in which encoder side work is not that interesting. Encoder work for PSNR is not interesting because we generally know directly
how to optimizing for MSE/SSD/L2 error - very simple ways like flat quantizers and DCT-space trellis quant, etc. What's more interesting
is perceptual quality optimization in the encoder. In order to acheive good perceptual optimization, what you need is a good way to
measure percpetual error (which we don't have), and the ability to try things in the code stream and see if they improve perceptual error
(hard due to non-local effects), and a code stream that is flexible enough for the encoder to make choices that create different kinds of
errors in the output. For example adding more block modes to your video coder with different types of coding is usually/often bad in a PSNR
sense because all they do is create redundancy and take away code space from the normal modes, but it can be very good in a perceptual sense
because it gives the encoder more choice.

ADDENDUM 3 : Case in point , I finally have noticed some x264 encoded videos showing up on the torrent sites. Well, about 90% of them don't
play back on my media PC right. There's some glitching problem, or the audio & video get out of sync, or the framerate is off a tiny bit, or
some shit and it's fucking annoying.

ADDENDUM 4 : I should be more clear - the most exciting thing about VP8 is that it (hopefully) provides an open
patent-free standard that can then be played with and discussed openly by the development community. Hopefully
encoders and decoder will also be open source and we will be able to talk about the techniques that go into them,
and a whole new

5/13/2010

What this means is VC thinks you have no SCC connection at all, your files are just on your disk. You need to change the default NiftyPerforce
settings so that it checks out files for you when you edit/save etc.

Advantages of NiftyPerforce without P4SCC :

1. Much faster startup / project load, because it doesn't go and check the status of everything in the project with P4.

2. No clusterfuck when you start unconnected. This is one the worst problems with P4SCC, for example if you want to work on some work projects
but can't VPN for some reason, P4SCC will have a total shit fit about working disconnected. With the NiftyPerforce setup you just attrib your
files and go on with your business.

3. No difficulties with changing binding/etc. This is another major disaster with P4SCC. It's rare, but if you change the P4 location of a
project or change your mappings or if you already have some files added to P4 but not the project, all these things give MSdev a complete
shit-fit. That all goes away.

Disadvantages of NiftyPerforce without P4SCC :

1. The first few keystrokes are lost. When you try to edit a checked-in file, you can just start typing and Nifty will go check it out,
but until the checkout is done your keystrokes go to never-never land. Mild suckitude. Alternatively you could let MSDev pop up the dialog
for "do you want to edit this read only file" which would make you more aware of what's going on but doesn't actually fix the issue.

2. No check marks and locks in project browser to let you know what's checked in / checked out. This is not a huge big deal, but it is a nice
sanity check to make sure things are working the way they should be. Instead you have to keep an eye on your P4Win window which is a mild
productivity hit.

One note about making the changeover : for existing projects that have P4SCC bindings, if you load them up in VC and tell VC to remove the
binding, it also will be "helpful" and go attrib all your files to make them writeable (it also will be unhelpful and not check out your
projects to make the change to not have them bound). Then NiftyPerforce won't work because your files are
already writeable.
The easiest way to do this right is to just open your vcproj's and sln's in a text editor and rip out all the binding bits manually.

I'm not sure yet whether the pros/cons are worth it. P4SCC actually is pretty nice once it's set up, though the ass-pain it gives when trying
to make it do something it doesn't want to do (like source control something that's out of the binding root) is pretty severe.

ADDENDUM :

I found the real pro & con of each way.

Pro P4SCC : You can just start editting files in VC and not worry about it. It auto-checks out files from P4
and you don't lose key presses. The most important case here is that it correctly handles files that you have
not got the latest revision of - it will pop up "edit current or sync first" in that case. The best way to use Nifty
seems to be Jim's suggestion - put checkout on Save, do not checkout on Edit, and make files read-only editable in memory.
That works great if you are a single dev but is not super awesome in an actual shared environment with heavy contention.

Pro NiftyP4 : When you're working from home over an unreliable VPN, P4SCC is just unworkable. If you lose
connection it basically hangs MSDev. This is so bad that it pretty much completely dooms P4SCC.
ARG actually I take that back a bit, NiftyP4 also hangs MSDev when you lose connection, though it's not nearly as bad.

5/12/2010

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way
to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another
env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have
some work code and some home code open at the same time they shit their pants.
I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something
like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

But I'm having second thoughts, because putting little config shitlets in my source dirs is one of the things I
hate about CVS. Granted it would be much better in this case - I would only need a handful of them in my top
level dirs, but another disadvantage is my p4bydir app would need to scan up the dir tree all the time to find
config files.

And there's a better way. The thing is, the P4 Client specs already have the information of what dirs on my
local machine go with what depot mappings. The problem is the client spec is not actually associated with a
server. What you need is a "port client user" setting. These are stored as favorites in P4Win, but there is
no authoritative list of the valid/good "port client user" setups on a machine.

So, my new idea is that I store my own config file somewhere that lists the valid "port client user" sets that
I want to consider in p4bydir. I load that and then grab all the client specs. I use the client specs to
identify what dirs to map to where, and the "port client user" settings to tell what p4 environment to set
for that dir.

I then replace the global p4.exe with my own p4bydir so that all apps (like NiftyPerforce) will automatically
talk to the right connection whenever they do a p4 on a file.

Since I ranted about Cleartype I thought I'd go into a bit more detail.
this
article on Cleartype in Win7 is interesting, though also willfully dense.

Another research question we�ve asked ourselves is why do some people
prefer bi-level rendering over ClearType? Is it due to hardware issues
or is there some other attribute that we don�t understand about visual
systems that is playing a role. This is an issue that has piqued our
curiosity for some time. Our first attempt at looking further into this
involved doing an informal and small-scale preference study in a
community center near Microsoft.

Wait, this is a research question ? Gee, why would I prefer perfect black and white raster fonts to smudged
and color-fringed cleartype. I just can't imagine it! Better do some community user testing...

1. 35 participants.
2. Comments for bi-level rendering:
Washed out; jiggly; sketchy; if this were a printer, I�d say it needed a new cartridge; fading out � esp. the numbers, I have to squint to read this, is it my glasses or it is me?; I can�t focus on this; broken up; have to strain to read; jointed.
3. Comments for ClearType:
More defined, Looks bold (several times), looks darker, clearer (4 times), looks like it�s a better computer screen (user suggested he�d pay $500 more for the better screen on a $2000 laptop), sort of more blue, solid, much easier to read (3 times), clean, crisp, I like it, shows up better, and my favorite: from an elderly woman who was rather put out that the question wasn�t harder: this seems so obvious (said with a sneer.)

Oh my god, LOL, holy crap. They are obviously comparing Cleartyped anti-aliased fonts to black-and-white
rendered TrueType fonts, NOT to raster fonts. They're probably doing big fonts on a high DPI screen too.
Try it again on a 24" LCD with an 8 point font please, and compare something that has an unhinted TrueType and
an actual hand-crafted raster font. Jesus. Oh, but I must be wrong because the community survey says 94%
prefer cleartype!

Anyway, as usual the annoying thing is that in pushing their fuck-tard agenda, they refuse to acknowledge
the actual pros and cons of each method and give you the controls you really want. What I would like is
a setting to make Windows always prefer bitmap fonts when they exist, but use ClearType if it is actually
drawing anti-aliased fonts. Even then I still might not use it because I fucking hate those color fringes,
but it would be way more reasonable. Beyond that obviously you could want even more control like switching
preferrence for cleartype vs. bitmap per font, or turning on and off hinting per font or per app, etc. but
just some more reasonable global default would get you 90% of the way there. I would want something like
"always prefer raster font for sizes <= 14 point" or something like that.

Text editors are a simple case because you just to let the user set the font and get what they want, and it
doesn't matter what size the text is because it's not layed out. PDF's and such I guess you go ahead and use TT
all the time. The web is a weird hybrid which is semi-formatted. The problem with the web is that it doesn't
tell you when formatting is important or not important. I'd like to override the basic firefox font to be
my own choice nice bitmap font *when formatting is not important* (eg. in blocks of text like I make). But if you
do that globally it hoses the layout of some pages. And then other pages will manually request fonts which are
blurry bollocks.

CodeProject has a nice font survey
with Cleartype/no-Cleartype screen caps.

GDI++ is an interesting hack to GDI32.dll to replace
the font rendering.

5/11/2010

Coded up some new goodies for myself today and released them in a new
cblib and chuksh .

RunOrActivate : useful with a hot key program, or from the CLI. Use RunOrActivate [program name]. If a running
process of that program exists, it will be activated and made foreground. If not, a new instance is started.
Similar to the Windows built-in "shortcut key" functionality but not horribly broken like that is.

(BTW for those that don't know, Windows "shortcut keys" have had huge bugs ever since Win 95 ; they sometimes work
great, basically doing RunOrActivate, but they use some weird mechanism which causes them to not work right with
some apps (maybe they use DDE?), they also have bizarre latency semi-randomly, usually they launch the app instantly
but occasionally they just decide to wait for 10 seconds or so).

RunOrActivate also has a bonus feature : if multiple instances of that process are running it will cycle you between
them. So for example my Win-E now starts an explorer, goes to existing one if there was one, and if there were a few
it cycles between explorers. Very nice. Also works with TCC windows and Firefox Windows. This actually solves a
long-time useability problem I've had with shortcut keys that I never thought about fixing before, so huzzah.

WinMove : I've been using this forever, lets you move and resize the active window in various ways, either by manual
coordinate or with some shorthands for "left half" etc. Anyway the new bit is I just added an option for "all windows"
so that I can reproduce the Win-M minimize all behavior and Win-Shift-M restore all.

I think that gives me all Win-Key functions I actually want.

ADDENDUM : One slightly fiddly bit is the question of *which* window of a process to activate in RunOrActivate.
Windows refuses to give you any concept of the "primary" window of a process, simply sticking to the assertion
that processes can have many windows. However we all know this is bullshit because Alt-Tab picks out an
isolated set of "primary" windows to switch between. So how do you get the list of alt-tab windows? You don't.
It's "undefined", so you have to make it up somehow.
Raymond Chen describes the
algorithm used in one version of Windows.

5/09/2010

Perforce Server was being a pain in my ass to start up because the fucking P4S service doesn't get my P4ROOT
environment variable. Rather than try to figure out the fucking Win 7 per-user environment variable shite,
the easy solution is just to move your P4S.exe into your P4ROOT directory, that way when it sees no P4ROOT
setting it will just use current directory.

New P4 Installs don't include P4Win , but you can just copy it from your old install and keep using it.

This is not a Win7 problem so much as a "newer MS systems" problem, but non-antialiased / non-cleartype text
rendering is getting nerfed. Old stuff that uses GDI will still render good old bitmap fonts fine, but newer
stuff that uses WPF has NO BITMAP FONT SUPPORT. That is, they are always using antialiasing, which is
totally inappropriate for small text (especially without cleartype). (For example MSVC 2010 has no bitmap font
support (* yes I know there are some workarounds for this)).

This is a huge fucking LOSE for serious developers. MS used to actually have better small text than Apple,
Apple always did way better at smooth large blurry WYSIWYG text shit. Now MS is just worse all around because
they have intentionally nerfed the thing they were winning at. I'm very disappointed because I always run
no-cleartype, no-antialias because small bitmap fonts are so much better. A human
font craftsman carefully choosing which pixels should be on or off is so much better than some fucking algorithm
trying to approximate a smooth curve in 3 pixels and instead giving me fucking blue and red fringes.

Obviously anti-aliased text is the *future* of text rendering, but that future is still pretty far away.
My 24" 1920x1200 that I like to work on is 94 dpi (a 30" 2560x2600 is 100 dpi, almost the same). My 17" lappy at 1920x1200 has some of the highest pixel
density that you can get for a reasonable price, it's pretty awesome for photos, but it's still only 133 dpi
which is shit for text (*). To actually do good looking antialiased text you need at least 200 dpi, and 300 would
be better. This is 5-10 years away for consumer price points. (In fact the lappy screen is the unfortunate
uncanny valley; the 24" at 1920x1200 is the perfect res where non-atialiased stuff is the right size on screen
and has the right amount of detail. If you just go to slightly higher dpi, like 133, then everything is too
small. If you then scale it up in software to make it the right size for the eye, you don't actually have
enough pixels to do that scale up. The problem is that until you get above 200 dpi where you can do arbitrary
scaling of GUI elements, the physical size of the pixel is important, and the 100 dpi pixel is just about perfect).
(* = shit for anti-aliased text, obviously great for raster fonts at 14 pels or so).

(
ADDENDUM : Urg I keep trying to turn on Cleartype and be okay with it. No no no it's not okay. They should call it "Clear Chromatic Abberation"
or "Clearly the Developers who thing this is okay are colorblind". Do they think our eyes only see luma !? WTF !? Introducing colors into
my black and white text is just such a huge visual artifact that no amount of improvement to the curve shapes can make up for that.
)

It's actually pretty sweet right now living in a world where our CPU's are nice and multi-core, but most apps are still single core. It means
I can control the load on my machine myself, which is damn nice. For example I can run 4 apps and know that they will all be pretty nice and
snappy. These days I am frequently keeping 3 copies of my video test app running various tests all the time, and since it's single core I know
I have one free core to still fuck around on the computer and it's full speed. The sad thing is that once apps actually all go multi-core
this is going to go away, because when you actually have to share cores, Windows goes to shit.

Christ why is the registry still so fucking broken? 1. If you are a developer, please please make your apps not
use the registry. Put config files in the same dir as your .exe. 2. The Registry is just a bunch of text strings,
why is it not fucking version controlled? I want a log of the changes and I want to know what app made the change
when. WTF.

The only decent way to get environment variables set is with TCC "set /S" or "set /U".

"C:\Program Files (x86)" is a huge fucking annoyance. Not only does it break by muscle memory and break a ton of batch files I had that
looked for program files, but now I have a fucking quandary every time I'm trying to hunt down a program.. err is it in x86 or not? I really
don't like that decision. I understand it's needed for if you actually have an x86 and x64 version of the same app installed, but that is
very rare, and you should have only bifurcated paths on apps that actually do have a dual install.
(also because lots of apps hard code to c:\program files , they have a horrible hack where they let 32 bit apps think they are actually in
c:\program files when they are in "C:\Program Files (x86)"). Blurg.

- how to move your perforce depot Annoyingly I used a
different machine name for new lappy and thus a different clientview, so MSVC P4SCC fails to make the connection and wants to rebind every project.
The easiest way to fix this is just to not use P4SCC and kill all your bindings and just use NiftyPerforce without P4SCC.

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way
to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another
env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have
some work code and some home code open at the same time they shit their pants.
I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something
like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

allSnap make all windows snap - AllSnap for x64/Win7 seems to be broken, but the old 32 bit one seems
to work just fine still. (ADDENDUM : nope, old allsnap randomly crashes in Win 7, do not use)