Archives

Categories

Meta

Month: October 2015

In my previous post I presented SDRRL, and in the one before that a demo of that algorithm. Since then, I have made many improvements to the algorithm, vastly increasing performance, both in terms of convergence rate and processing power used. I have another demo, but this time it is not a web demo, since it is something I used for internal testing that I just cleaned up a bit 🙂

SDRRL v2.0 Demo

I present to you a simple “Big Dog” style demo, where SDRRL must learn to move a robotic dog body to the right. Almost all of the processing time spent is taken up by the physics engine instead of the AI.

When running the demo, press T to speed up time, and K to reverse the walking direction.

So, the first feature I harp on a lot – something backpropagation-based solutions lack, and that is fully online learning without experience replay or stochastic sampling (which have horrible computational complexity).

The second feature is there because this is based off of my PRSDR algorithm, which is basically a hierarchical LSTM replacement (for those interested, I have some performance benchmarks, showing the up sides and down sides). It’s the usual HTM-like bidirectional predictive hierarchy thing.

Actions are selected by perturbing the predictions towards actions that lead to higher reward. Right now I am using a simple policy gradient method to do this.

Now, the last two points are sort of the same thing: This model has imagination. I’m serious! The basic idea is as follows: Leak some of your own predictions into your input. This way, the model tries not only to run and predict off of the world, but also itself. It tries to predict its own predictions – lead to a sort of sensory-implanting imagination similar to that that humans have. Sure, this imagination isn’t really necessary for AGI, but it’s a good heuristic to speed up learning I think. It allows for simulation of situations ahead of time, and planning as a result.

Other than that the model uses good ol’ SARSA for reward prediction and temporal difference error generation.

I am working on some demos for it now, and am trying to train it on the ALE. Let’s see how that goes!

In the last post I provided an online demo for a new reinforcement learning algorithm. However, I did not provide many details on how it actually works, certainly not near enough for anyone else to code it without looking at the Javascript themselves.

So, here is a tutorial on how to get your own super-processing-power-efficient reinforcement learner!

SDRRL

Overview

SDRRL (sparse distributed representation reinforcement learning) was created in order to produce a simple reinforcement learning agent that can handle continuous states, actions, and could assign credit back in time. However, what sets it apart from other reinforcement learners with similar capabilities is the computational efficiency of the algorithm.

An important characteristic of SDRRL is also that it doesn’t use backpropagation or the experience replay/rehearsal that if often associated with backpropagation-based reinforcement learners. Instead, it avoids catastrophic interference by using SDRs (sparse distributed representations). These are also a source of computational efficiency, since as their name suggests, we often only have to update a handful of nodes at a time.

SDRRL is an actor-critic algorithm. Both the actor and critic feed off of the sparse distributed representations for input, which in turn are learned in an unsupervised fashion. Q values and the policy are updated using an algorithm similar to CACLA (read) that has been extended with eligibility traces.

Below is an image describing the overall network architecture. Connections are only shown between active nodes. Note that the hidden layer is binary, a by-product of the particular SDR learning algorithm used in this case. All other nodes are continuous.

Here, each node in the hidden layer is called a cell. This naming comes from HTM terminology (hierarchical temporal memory). This reinforcement learning agent is meant as a possible explanation of cortical microcolumn functionality (this will be the subject of a future post).

The inhibitory sheath surrounds the cells, and ensures that on average the cells have some desired sparsity level. The sheath can be viewed as a set of connections on the cells. The cells are then fully connected with these inhibitory connections.

In this tutorial, we will use a very simple sparse distributed representation learning algorithm, inspired by more complex models such as SAILnet (read) which use spiking neurons.

SDR formation

An iteration of the algorithm starts by computing the sparse distribution representation of the input:

Where A is the activation value of the cell, I is the inhibition, and S is the resulting state. W, B, and N are weights (B is a per-cell bias). In is the input to the agent.

The idea behind this is as follows: We compute a linear combination of the inputs for each cell based on their “feed forward” weights, and then inhibit them to be sparse using their “lateral” weights. The lateral connections do not provide any inhibition if the activation of the current cell is greater than that of the neighbor. If it is less, however, it receives the inhibition weight N. This is a sort of comparison-based inhibition algorithm. The advantage of this over iterative algorithms that solve for the SDRs is that it is blazing fast.

Now, we compute a reconstruction of the input based on this SDR:

This is basically just a reverse-activation of the “feed forward” weights. There are alternatives to this reconstruction method, such as Oja’s rule (Hebbian learning), but I find that this produces better results.

To learn the SDR weights, we use the following updates:

Where alpha, beta, and omega are learning rates. Rho is the target sparsity level.

The intuition here is that we update the feed-forward weights to minimize the reconstruction error. We only update on states (S) that are 1, the rest can be ignored. The “lateral” weights are updated such that the covariance of two cells is equal to the covariance of two ideally sparse cells. Finally, the bias (B) is adjusted to simply maintain sparsity levels.

Now we have a way of learning SDRs. This method is fully online, it doesn’t need any sort of stochastic sampling due to its sparseness.

Q and Actions

Q values and actions are trained on the SDRs we now have. Both are updated with different forms of eligibility traces.

First, we compute the actions Act and Q value from the SDR:

Where V and P are weights, and f is the logistic sigmoid function.

An exploratory version of the action (Act) must be generated, we will call this Actexp. The exploration can be done with various methods, I simply used a combination of epsilon-greedy and normal distribution perturbations.

Once we have Q, we can compute the temporal difference error using SARSA:

We are then ready to update the Q and action weights:

Where T and E are eligibility traces, and phi and theta are learning rates. Lambda is a trace decay factor, which is piled on top of gamma.

Here, Q uses replacing traces, hence the max function. The traces for the actions function like momentum. They can also be modified to standard accumulating traces if desired, but I find that these work better.

Conclusion

So, if all goes well, you should have a fun little reinforcement learning agent that is fully online and uses no backpropagation or stochastic sampling/experience replay/rehearsal.

I created a new reinforcement learning algorithm, and thanks to this new website, I have a three.js demo for it in this post!

The reinforcement learning algorithm is a combination of my one-iteration sparse distributed representation unsupervised learning algorithm as well as a version of the continuous actor-critic learning automaton with eligibility traces.

It works entirely without backpropagation! It also doesn’t use stochastic sampling from a replay buffer. The SDRs assure that there is little to no catastrophic interference. Everything is updated in one go over the weights per timestep.

This algorithm is still a bit of a prototype, but I think it works well enough to warrant a demo!

When running the demo, you can speed up time by dragging the slider in the controls menu.

The bits at the top left represent the current SDR.

The agent should learn to crawl withing a few seconds with the speed turned up to max.
It may get stuck at times, if this is a case just refresh the page!