New Ideas for Reinforcement Training a Neural Net

February 15, 2020

I want to reiterate, or just iterate, that I have almost no knowledge about software-based neural networks. I am guessing at things as I try to learn more.

In my last post, I mentioned my thoughts about how neural networks work. I also mentioned two types of training: evolutionary and reinforcement. I know exactly how the evolutionary process works and since it is straightforward, you probably do too. Reinforcement training is what had me stumped. But after a day of thought and a few minutes of writing code, I think I know why reinforcement training is impossible for some networks.

Think of a pet dog and how it learns to do things by getting treats when it succeeds at a task. There are pathways through the brain that become “stronger” after a success. If we investigate current OCR neural networks (for recognizing handwritten text), we can see that the typical network produces a strong output for the character it “thinks” is most likely and weak output for the other less-likely characters. There are paths through the network that essentially point to a result. If you are using reinforcement training, you can have the network strengthen all of the strong connections and weaken all of the weak connections. That simplification works pretty well and makes a bit of sense. When given the exact same input, the network should then produce a stronger output for the character it “thinks” is best.

There is a theory that the human brain does this sort of thing over and over during training but then during sleep, all connections are weakened. The result of the weakening is to “throw out” false positives. Some of our brain’s learned actions are wrong and those are usually influenced by weaker connections. This theoretical unlearning process could also be done with a software neural network and if the strengthening and weakening math is done just right, it should work well.

How can this not work?

The answer is that some neural networks don’t use a system of outputs where the one with the highest output strength is the answer. If you look at examples of networks for balancing a ball on a platform, you will see a network where the output is an angle for the platform. This simplified network doesn’t pick a result from a set of possibilities, it outputs a strength that is used to directly manipulate the platform. After a success, how would the software go about reinforcing the output? It can’t strengthen the connections numerically because that makes the result different. There is no obvious simple way to feedback a “reward” into this network.

I did think of ways to accomplish a reinforcement training system for a single-output neural network and failed. I tried to think of how connections could have a separate strength value from the output value but I didn’t get far enough to decide if it would work. In a simple two-input, three-intermediate, one-output network for a balance beam network, it isn’t clear what a weak connection with a strong value would do.

One solution to the problem of the single-output neural network problem is to change the network to have many outputs. Having outputs for every 10 degrees of desired beam rotation might work and those outputs could feed into a single angle computation function that gives a single output. If the beam is in a good spot, the connections are all strengthened and that should, if done well, make the desired angle more likely when given the same input later. The reason this works is that the connections are causing a strong path through the network to get to a result. The single-output network is using a strong value, not the path, to get a result. It’s easy to reward the multiple-output network for picking a strong path by making that path more likely in the future. And yes, that 10 degrees idea won’t work too well since there is still a possible problem of output strength representing an angle. It’s not perfect at all. Maybe a structure where a set of outputs adds to the selected 10 degree value. And maybe another set of outputs could add to that result giving us even more resolution. An additive system where it’s all about the paths but the results cause various values to get added together to get the final angle, could work very well… and take way more space than a simple six node network with a single-output.

That’s my interpretation on how reinforcement can work with a neural network. The network must be a network that uses paths to get to results. Is this how the human brain works? It seems like it can’t really work using a single-output network for anything or there would be no sensible way to train it… as far as I know :)