A quick post about the results for my first comparison here of a 2-layer fully connected network vs a DagNN.

I've removed most of the random variables here for this example so that the comparison is pretty accurate. The only random variable left is the order in which things are trained due to SGD - however, as I removed more and more random variables the differences got more in favor of DagNN and not less.

The conclusion of this test is that DagNN is better node-for-node per epoch than the standard 2-layer fully connected network - at least in this example.

I had an idea the other day while reading a paper about how they passed residuals around layers to keep the gradient going for really deep networks - to help alleviate the vanishing gradient issue. Then it occurred to me, perhaps that this splitting of networks into layers is not the best way to go about it. After all, the brain isn't organized into strict layers of convolution, pooling, etc... So perhaps this is us humans trying to force structure onto an unstructured task. Thus the DagNN was born over last weekend. Directed Acyclic Graph Neural Networks or DagNN for short.

First, a quick description of why/how many Deep Neural Networks are trained today as I understand it.

The vanishing gradient problem is a problem to neural networks that arise because of how back-propagation works. You take the difference between the output of a network and the desired output of a network and then take the derivative of that node and pass that back through the network weighted by the connections. Then repeat for those connections on the next layer up. So you are passing a derivative of a derivative for 1 hidden layer networks and a derivative of a derivative of a derivative for 2 layer networks and so on. These numbers get "vanishingly" small very quickly - so much so that typically you tend to get *worse* results with a network with 3 or more layers vs just 1 or 2.

So, how do you train "deep" networks with many layers? Typically with unsupervised pre-training, typically with an auto-encoder. An auto-encoder is when you train a network, 1 layer at a time stacking on top of each other with no specific training goal other than to reproduce the input. Each time you add a layer you lock the weights of the prior layer.

This means your training a generic many layer network to just "understand" images in general as a combination of layered patterns rather than to solve any particular task. Which is better than nothing, but certainly not as good as if you could actually train the *entire* network to solve a specific task (intuitively).

The solution: If you could somehow pass the gradient further down into the network, then you can train it "deeper" to solve specific tasks.

Back to DagNNs.

The basic premise follows the idea that if you pass the gradient further down the network, then you can train deeper networks to solve specific tasks. Win! But how?

Simple, remove the whole concept of layers and just connect every node with every prior node allowing any computation to build on any other prior computation to solve the output. This means that the gradient filters through the entire network from the output in fewer hops. The way I like to think about DagNNs is the small world phenomenon. Or the degrees of kevin bacon if you prefer. You want your network to be able to get to useful information in 2-3 hops or the gradient tends to vanish.

Pro tip: if you want to bound computational complexity, limit it to a random N number of prior connections per node.

I'm trying out this idea now and at least initially it is showing promise. I can now train far bigger fully connected networks than I could before. Will release source when I have more proof in the pudding. By proof that means proof for me too! I need to train it on MNist and compare results.

Neural Networks offer great promise with their ability to "create" algorithms to solve problems - without the programmer knowing how to solve the problem in the first place. Example based problem solving if you will. I would expect that if you knew precisely how to solve a particular problem to the same degree, you could certainly do it perhaps many orders of magnitude faster and possibly higher quality by coding the solution directly -- however its not always easy or practical to know such a solution.

One opportunity with NNs that I find most interesting is that, no matter how slow they are, you can use NNs as a kind of existence proof -- does an algorithm exist to solve this problem at all?

Of course, when I'm talking about "problems" I'm referring to input to output mappings via some generic algorithm or something. Not everything cleanly fits into this definition, but many things do. Of course there are many different kind of NNs for solving various different kinds of problems too.

After working with NNs for a while (and indeed to anybody who has) I can say that Neural Networks are asymmetric in complexity. That is, training a neural network to accomplish a task can take extreme amounts of time (days is common). However, executing a previously trained Neural Network is embarrassingly parallel and is pretty well mapped to GPUs. Running a NN can be done in real-time if the network is simple enough!

I have spent considerable amounts of time in figuring out how to train Neural Networks faster. The generally recommended practice these days is to use Stochastic Gradient Descent (SGD) with Back Propagation (BP). What this means is you take a random piece of data out of your training set to train with, train with it, and then repeat.

SGD works, but is *incredibly* slow at converging.

I endeavored to improve the training performance here (how could you not, you spend a *lot* of time waiting...)

There are many different techniques to improve upon BP (Adam, etc.. etc..), however each of them are in my measurements slower, regardless of the steeper descent they provide, they take more computation to provide that steeper descent and so when you measure not by epoch but by wall clock time, its actually slower.

So, then came the theory that if you somehow knew the precise order to train the samples in, you could perfectly train to the correct solution in some minimum amount of time. I don't know if there is a theorem about this or what-not, but if not you now have heard of it. It seemed common sense to me.

In any case, then the question becomes is there a heuristic which can approximate this theoretical "perfect" ordering?

The first thing I tried turned out to be very hard to beat, calculate error on all samples, then sort the training order by the error for each in decreasing order. Then, only train 25% of the worst error samples.

The speedup from this approach was pretty awesome, but again I got bored waiting so I went further. Essentially you don't waste time training the easy stuff and instead concentrate on learning the parts it has problems with.

I then tried doing many variations on this, but the one that ended up working even better (30% improved training time) was taking the sorted order and splitting it into 3 sections of easy, medium and hard. Then reorganizing the training order into hard, medium, easy, hard, medium, easy, hard, medium, easy, etc...

Not only did this improve the training time - it also was able to train to an overall lower error than without.

Another option that works pretty well is to just take the 25% highest error samples and randomize the order. Its easier to implement and also works really well. Also, this should be overall a better approach (vs unrandomized) as it seems more robust to training situations where the error explodes (which does happen in some cases).

Thats generally how I would approach a finite and small-ish data set.

I am also developing a technique based on this that works for significantly larger data sets - ones that cannot possibly fit in memory (hundreds or thousands of images).

Thus far the setup is fairly similar, except you pick some small batch of images and do basically the same as above with that.

There are some interesting relationships between batch size (number of images) and training time/quality.

In my data set, the size of the batch reduces the variance of the solution error across the training set and also appears thus far to reduce the number of epochs required to converge - however it is also slower so the jury is still out on if a bigger batch is better - but certainly going too small makes it harder to converge on a general solution.

ICtCp claims that it provides an improved color representation that is designed for high dynamic range (HDR) and wide color gamut (WCG). It also claims that for CIEDE2000 color quantization errors 10-bit ICtCp would be equal to 11.5 bit YCbCr. Constant luminance is also improved with ICtCp which has a luminance relationship of 0.998 between the luma and encoded brightness while YCbCr has a luminance relationship of 0.819. An improved constant luminance is an advantage for color processing operations such as chroma subsampling and gamut mapping where only color information is changed.

Note, that I haven't verified these claims yet...

Again, here is this color space converted to LDR via tone mapping.

I = 10,000L

Animating I from 0L to 10,000L

Evaluation of Color Spaces

To evaluate the color spaces, I am looking for a few different properties (that come to mind)

When interpolating between colors, we want the CIE Luma to be reasonably constant. This is especially important when subsampling to 4:2:2 or 4:2:0 - as well as important when decoding if we stretch the video beyond its 1:1 pixel size (for example when rendering to a texture, or rendering a lower resolution video to a higher resolution display).

Ringing artifacts must be evaluated which result naturally from a DCT transform common among video codecs. For example, if you have a really bright point on a darker background, how would the ringing artifacts appear?

Quantization artifacts must also be considered as a result of normal video encoding stuffs.

In this series of blog posts, I'm going to cover various parts of the research and development behind Bink 2 HDR.

So first thing is deciding on an encoding. Of which there are a very many to choose from.

There is...

RGBM

RGBE

XYZE

CIE XYZ

LogLuv (both 24 & 32 bit)

ILM OpenEXR (EXR) (supposedly the best)

Microsoft scRGB

scRGB-nl

scYCC-nl

Just to name a few. Additionally with video games, we have additional constraints such as texture filtering and performance considerations, etc... For example, bi-linear filtering is a linear operation, and the luma representation would have to operate correctly under linear transforms (or at least be fast enough to decode so that it wouldn't matter to first decode then interpolate). Additional x 2 for a video format like Bink, we need to consider various compression artifacts and what those would look like.

With so many different formats to choose from, you have to take a step back and instead look at the actual encoding used by the output itself - as that really determines what is best (or used directly). Which leads me to the next topic of SMPTE 2084.

SMPTE-2084 is the format to which Dolby Vision and HDR10 displays use - so its basically the narrow part of the pipeline. Everything you want to display has to go through this non-linear encoding at some point before being displayed on the TV (decoded back to linear in the process as well).

The SMPTE-2084 format is locked behind a pay wall (yay) - which I have purchased and will boil it down for you to what I believe is the most important parts.

The format defines luma in absolute values between 0 to 10,000 cd/m^2 (candelas per square meter). However, with the caveat that in real implementations of the spec, 10k luma won't actually be representable in anything but pure white color. Additionally, actual displays vary from the absolute curve due to output limitations and effects of non-ideal viewing environments.

While the format supports 10, 12, 14, and 16-bit Luma representations, as currently deployed Dolby Vision is 12-bit and HDR10 is 10-bit. 14 and 16-bit is not widely deployed - if deployed at all anywhere other than the reference monitor.

The pretty nice thing about the limited range here (10 to 12 bits) is that you can pre-generate a table to decode into floats and store it in a texture or constant buffer or whatever. This makes decoding rather inexpensive!

There are still some open questions here regarding its suitability as a video encoding format that I have. Namely...

Ringing artifacts from a DCT transform, how bad are they?

Quantization artifacts

What is the error from linear interpolating the encoded format vs decoding then linear interpolating values?

Some notes that this writer takes advantage of the fact that MPEG1/2 is designed so that you can literally concatenate files together to combine movies. Technically each frame is its own movie (until p-frames are implemented, then a set of frames would be its own movie).

Some video players mess up here and don't decode correctly, but MPlayer, SMPlayer, FFMpeg and others work correctly. So this is great as a quick intermediate format!

Many years ago I started work on translating a poorly written paper on a really good shadow mapping technique for publication in Game Engine Gems. I never finished the paper, but the technique is used in production in Firefall. Of all the shadow mapping techniques I've tried, its the best for Firefall's use case. Rather than wait for perfection, I figured I'd just post it in case somebody finds it useful.

The primary features are1) 398 lines of code2) No memory allocations3) Public Domain4) NeuQuant based quantizer

I think there is still a lot left that could be done here with it, but I feel its a good version 1.0.

Namely, missing alpha support. Easy to do, but needs to be done. Also some improvements to the color quantization. After going through NeuQuant I found quite a few things which I feel can be improved which I hope I have time to get to.

The crescent bay demo was very good. The headset was light, latency was fantastic, the picture was solid and sharp taking very good advantage of low persistence, the content was beautifully rendered and very enjoyable. All told a very solid improvement over dk2.

The valve demo was better and it's hard to explain why. Their tracking was just as good, their latency and/or persistence was *slightly* worse, I *think* the headset was heavier ( I didn't have both at the same time to directly compare ), they had no hrtf audio - so it wasn't technically a win but still it was better for some reason... Why?

I think for a few reasons. One is that the walkable area is 5 times bigger than what oculus demoed increasing immersion immensely, second is they had perfectly tracked controllers which provided a way for you to interact with the virtual world in a very fun and personal way, and third ( and most important ) is that the content they showed was amazingly fun and really showed off the walkable area and controller interaction.

The first demo was a small controller introduction where you would press the right trigger and a balloon would blow up out of your hand and float away. It was physically simulated so that you could then interact with the balloon with the controllers. At one point I tried to catch the balloon by instinctively pressing it against myself, but it went right through me ( which was a very weird sensation ). this demo was so incredibly fun.

The second demo iirc was where I was on a bridge of a sunken ship under the ocean. Lots of creatures swam by including a giant whale. Very peaceful.

Next I think was the VR painting demo which showed a really cool 3d interface and some pretty awesome painting. Very beautiful and very fun!

Another demo of a tabletop game where miniature people were fighting each other. Pretty cool, but nothing to write home about there. Though I can see a cool game being made with this kind of setup.

There was a surgeon simulator demo which was pretty darn awesome :) you are in space with an alien on the table and another table with various tools on it. Your controllers turned into hands which you could open and close very similarly to a prosthetic hand. It was a little awkward, but I laughed and had a lot of fun doin surgery then taking alien organs and stuffing them into the aliens mouth. Lol

There was another demo where you would walk around and depending on where you walked the room would change to a different room. I think this game was there to show off a kind of transportation travel method. It was fun, but a bit confusing.

Last demo was an aperture science demo where they had you open drawers pull levers and try to fix a broken robot. Was lots of fun. Then finally the walls were torn away to find yourself in a shipping crate. A giant robot came by. The floor started to tear away. Was just fantastic.

That's the end of the demos and then something funny happened that I can't fully explain.

When the headset came off, I had a very primal need to get back into VR. I didn't *want* to get back into VR, I *needed* to. Something was compelling me. I noticed it right away as foreign and was a bit confused about how a VR experience can ilicit a drug like response.

I spend a lot of time in VR, so this is very unusual.

I think the primary cause is the content difference, but I can't be sure.

I have been told I am not the only one, and that is kinda cool and also kinda scary. It means awesome things if you are a VR dev as it means VR will spread like an unstoppable virus. They will not be able to make VR headsets fast enough to meet demand - not by a long shot.

The down side is that there are some possibly serious and negative societal side effects of VR. That is beyond what I already worried about before it was addictive. We may see some government regulation of VR.

All told though as a VR dev myself, I'm super excited and impressed that VR has come this far in such short a time. One thing is clear, the future is virtual.