Here is the preprint of my Ms thesis with the title:
“Hierarchical Temporal Memory Based Autonomous Agent For Partially Observable Video Game Environments”

I defended it a couple of days ago. It proposes a non-player character architecture that combines HTM and TD(lambda) and presents results on a 3D video game learning task. The architecture incorporates abstracted functionalities of layer 5 and basal ganglia pathways in order to produce voluntary actions. It does so by presenting the relevant neurobiological research. Below is the supplementary video for the thesis. I would recommend reading the thesis for the video to make sense. Enjoy

@bela.berde Anything specific you want to know? You can see that 6 HTM layers with 800 columns and 8 neurons per column learn in real time (around 20ms in total) on a midrange laptop. The exact timings are presented in the results section.

Do you plan to setup a Planning Phase in this environment for multi-agent formation ?

A multi agent formation was the idea behind gameplay. If by Planning Phase, you mean communication among agents to accomplish a task, I do not think a conventional planning phase would be suitable for HTM. The main problem is every agent has their own representation for the same environment state and these are fuzzy neural activations. So the communication would require a higher level mechanism to associate these representations with each other.

Rather than a planning phase, my plan is to make it so that the agents demonstrate their learned behavior sequences to others once in a while. You can teach the agent as a player at the moment. This would hopefully result in learning and improving the behavior of the others.

xentnex:

Is HTM a good fit for this kind of video game play from perspective of Memory, CPU and etc. ?

On the computational costs, an agent with 10 million synapses takes up around 200MB in memory and 50MB if you serialize onto hard drive. Each iteration of architecture takes around 20 ms per agent with 800x8 HTM layers. So if you want to pull this off in real-time (30 fps), you can’t update for example 10 agents on a single iteration. You need to do some sort of interleaving computation. A single agent update per game loop iteration would work too. Or you would have to go for smaller HTM layers (for example 512x8) which would decrease the agent’s capability. Of course a GPU or parallel implementation could work wonders. So I would say HTM is not the best choice in terms of resources but it has potential.

This is really cool, and quite amazing that you found time to write the game engine, 3D rendering system and all, alongside the research. Not embarrassed to admit it will take me quite a few reads through before I can follow it completely

Hey @sunguralikaan, congratulations on this awesome work! And thanks for the shout-out.

First of all, excellent visualizations. But the technical contribution is even more intriguing. I’m deeply involved in reinforcement learning at the moment, so I’ll be digesting this quite carefully.

I especially like that you’ve modelled a visual sensor, so in principle this architecture could be applied to the standard battery of benchmarks in video-game-based reinforcement learning. I know you’ve just defended, so I’m sorry for making you answer more of these, but as an academic I have a few necessary questions:

Have you compared this method to the successful contemporary approaches (Q-learning, advantage actor-critic) on the environment you created?

Since you have a visual interface, have you tried applying your system to the standard Atari or VizDoom benchmark environments? They’re easy to get up and running (pip install gym; pip install vizdoom) so it might be worth the time.

Comparing to the existing techniques on your custom environment and on established task environments will make it easier for people to engage with your contributions. Once again, very cool work that I’ll be giving my full attention!

So the scope of the thesis was to propose a real-time HTM based autonomous agent directed by neurobiological research. The task itself was daunting enough given the aim. I would have loved to have comparisons but they were low priority in the end.

jakebruce:

Have you compared this method to the successful contemporary approaches (Q-learning, advantage actor-critic) on the environment you created?

Do you mean an HTM-QL system? I have not compared HTM-TD(lambda) to HTM-QL because it did not make sense in terms of neurobiology. QL decouples actions from states which we (HTM community) believe is not true. Layer 5 output is the state and the action. Also what the agent does do not effect its state values in QL (off-policy). So there is that too. In addition, none of the computational models of basal ganglia utilize QL but there are ones imitating TD(lambda) because of its correlation with dopamine secretion in striatum.

If you are asking about just applying QL or advantage actor-critic on the task without HTM, I guess the agent state would be the 2D image. No I have not tried that but I am almost sure that a basic QL would beat the architecture by quite a margin IF you could represent all the possible visual data as different states and map it on the memory. I think this would also run faster than the proposed architecture given the complexity of the learning task, a basic POMDP.

jakebruce:

Since you have a visual interface, have you tried applying your system to the standard Atari or VizDoom benchmark environments? They’re easy to get up and running (pip install gym; pip install vizdoom) so it might be worth the time.

I am definitely interested in this, especially in VizDoom since I was in the conference where it was first presented - CIG 2016. They also host some other video game AI benchmarks like GVGAI (General Video Game AI). On the other hand, I argued on that same conference that our current benchmark environments are limiting us severely on the path of general intelligence. We are only evaluating the output of the underlying intelligence through these benchmarks, not the functionality of intelligence. This is why I am interested in HTM.

jakebruce:

Comparing to the existing techniques on your custom environment and on established task environments will make it easier for people to engage with your contributions.

I totally agree with you on that and the study is surely missing some form of comparison (other than random walk) at this point. Then again, I am sure HTM in its current state and by extension this architecture would get butchered by the other state of the art approaches in these benchmarks and I think you know why. One could only present that it learns in real-time, online and continuously as advantages if you leave out the neurobiology part. Now if there was another approach that claimed neurobiological plausibility that also had these benchmarks, than a comparison would be meaningful. The closest one is Nengo, Spaun and they are understandibly not interested in these benchmarks. So I guess what I am trying to say is, this sort of an approach misses the point of HTM.

There was another AI benchmark proposed a year ago - Good AI General AI challenge. I think this would be a better candidate but when I read their evaluation metrics, I am not sure if they were able to come up with a proper benchmark design but it looks better than what we have for evaluating general intelligence. It is hard to evaluate GI afterall.

Goals of the Round

o To get working examples of agents that can acquire skills in a gradual manner and use learned skills to learn new skills (increasing the efficiency of learning).

o We are not optimizing for agent’s performance of existing skills (how good an agent is at delivering solutions for problems it knows). Instead, we are optimizing for agent’s performance on solving new/unseen problems.

Example:
if an agent is presented with a new/unseen problem, how fast (i.e. in how many simulation steps) will it deliver a good solution? This also includes a question of how fast the agent will be at discovering this new solution. If the agent has already learned to find solutions for similar problems, it should use existing skills in order to discover the new skill.

o Agents must provably use gradual learning and will be evaluated on how fast they are (how many simulation steps do they need) at discovering acceptable solutions to new tasks.

This looks like a much better fit for the architecture I proposed. I just spawn the agent in and watch it learn. I could take the same agent, put it in a new task and watch it learn again. Of course, it is not very good at it at this point. I would love if they came up with a way to evaluate spatiotemporal abstractions where the agent gradually works on higher level abstractions (union/temporal pooling as you also know).

TL;DR: Thanks for the crucial questions. A VizDoom benchmark would certainly be helpful but it had a lower priority compared to presenting a better architecture. Hopefully in the future.

TL;DR: Thanks for the crucial questions. A VizDoom benchmark would certainly be helpful but it had a lower priority compared to presenting a better architecture. Hopefully in the future.

No problem! Very understandable; that’s what I expected.

sunguralikaan:

If you are asking about just applying QL or advantage actor-critic on the task without HTM, I guess the agent state would be the 2D image. No I have not tried that but I am almost sure that a basic QL would beat the architecture by quite a margin IF you could represent all the possible visual data as different states and map it on the memory. I think this would also run faster than the proposed architecture given the complexity of the learning task, a basic POMDP.

That’s what I meant, yeah. Most of the big recent successes in RL have been using the RGB pixel array as the observation, and the convolutional layers learn adequate features using the end-to-end gradient updates. If you do end up making some comparisons, check out OpenAI universe’s starter agent for an easy system to get up and running. It’s just a few feedforward convolutional layers with a recurrent layer on top that outputs the policy. It’s advantage actor-critic, which is an on-policy approach that has a significant degree of overlap with biology (reward prediction error, eligibility traces, online learning).

I have an intuition that HTM-based systems could in principle do more aggressive weight updates because of their sparsity, and therefore learn from fewer samples than either Q-learning or actor-critic can, but I haven’t attempted to demonstrate this (vision is hard enough as it is without trying to put one-shot learning on top).

sunguralikaan:

One could only present that it learns in real-time, online and continuously as advantages if you leave out the neurobiology part. Now if there was another approach that claimed neurobiological plausibility that also had these benchmarks, than a comparison would be meaningful.

True, but even if you approached a fraction of the performance of the engineered solutions, I think that would be compelling to many people. And it may be closer than you think. I also would suspect that many people consider actor-critic architectures to be quite biologically plausible at a high level, since the critic maps directly onto dopaminergic activity, but of course it depends on your personal tolerance of backpropagation as a learning rule.

sunguralikaan:

There was another AI benchmark proposed a year ago - Good AI General AI challenge. I think this would be a better candidate but when I read their evaluation metrics, I am not sure if they were able to come up with a proper benchmark design but it looks better than what we have for evaluating general intelligence. It is hard to evaluate GI afterall.

A lot of people are looking at similar things under the umbrella of transfer learning, and hierarchical reinforcement learning, and the field of artificial curiosity is similar. An agent should be able to learn much more quickly by leveraging its previous knowledge to apply those skills to new problems and domains. Learning on one level of Doom and then transferring to a new level, for example.

However, we can’t yet even train an agent to navigate a simple maze without giving it months worth of simulated experience, so maybe a simple benchmark isn’t a bad place to start.

It’s advantage actor-critic, which is an on-policy approach that has a significant degree of overlap with biology (reward prediction error, eligibility traces, online learning).

jakebruce:

I also would suspect that many people consider actor-critic architectures to be quite biologically plausible at a high level, since the critic maps directly onto dopaminergic activity, but of course it depends on your personal tolerance of backpropagation as a learning rule.

I am confused about one thing though, isn’t the architecture I proposed, an actor-critic architecture [HTM-TD(lambda)]? According to my understanding, TD(lambda) is the essence of actor-critic architectures along with LSTD (least-squares temporal difference) and I presented relevant research on its biological plasubility. So, yes I agree with its dopaminergic correlation. But I also presented research on the biological plausibility of back propagation which is known to be implausible since 1980s. Many people proposed biologically more plausible learning alternatives since then, HTM being one of them.

From what you wrote, it is as if actor critic models were using something different than TD(lambda). Are we on the same page or am I missing someting?

From what you wrote, it is as if actor critic models were using something different than TD(lambda). Are we on the same page or am I missing someting?

Actor-critic algorithms are TD-based, but so is Q-learning. Basically every approach to reinforcement learning uses the concept of TD, where you bootstrap from your prediction rather than the reward signal itself. I was just talking about how the contemporary approaches may be more biologically plausible than one might expect.

sunguralikaan:

I also presented research on the biological plausibility of back propagation which is known to be implausible since 1980s.

That is indeed the current consensus, but it’s definitely not a settled issue. For example, a fair amount of work has proposed variations of backpropagation that could feasibly be implemented in the brain. One recent example, called weight alignment [1], proposes that fixed random projections can implement a variation on backpropagation that can work equally well under certain circumstances, even across multiple layers. Backpropagation-through-time definitely seems harder to integrate with biology, but it’s not necessarily out of the question.

In any case, backpropagation-based convolutional networks for actor-critic may be more directly bio-plausible than those using Q-learning (correspondences between experience buffers and hippocampal replay notwithstanding), so it could be a good target for comparison. And the most successful current reinforcement learning system (A3C) is an actor-critic architecture, so it’s definitely a reasonable choice.

Hi @sunguralikaan,
Please let me add my congratulations and thanks for the work you have done. The analysis and SW tools you created are super impressive! I am looking forward to reading your thesis soon (need to set aside some time for that!). You made my day/week/…

From what you wrote, it is as if actor critic models were using something different than TD(lambda). Are we on the same page or am I missing someting?

After reading your thesis in detail, I see the source of the confusion. Most contemporary reinforcement learning operates in the forward view with gamma-discounted trajectory rollouts, whereas I see now you’re operating in the backward view with eligibility traces. My original point just amounts to a suggestion to compare against A3C, which although it’s a forward-view approach, is still broadly biologically plausible (I view the forward and backward approaches as essentially just an implementation detail).

Uow! These 3D visualizations are insane!!! Congratulations!! It gave me new ideas about my new tool related to robotics which will be integrated to Nupic Studio (which it now looks a stone age tool compared to yours… Maybe it’s time to replace PyQtGraph library to some 3D game engine to show neurons… ).

I view the forward and backward approaches as essentially just an implementation detail

I am having some difficulty visualizing how a forward RL strategy might be implemented with HTM-like neurons. Forgetting about pooling, layers, and other brain structures for a minute, at the most basic level, it is easy to visualize how future rewards can be learned and predicted by visualizing a cell representing reward growing connections with other cells that represent motor/context over time. Something like:

This is of course depicting a backward RL strategy. The motor/context cells are essentially an eligibility trace which a current reward is able to connect with when it happens some time after that motor/context occurred, allowing it to be predicted in advance when a semantically similar motor/context occurs again.

What would be the basic connections between neurons (and what would those neurons represent) in a forward RL strategy? I know this is getting off topic from suguralikaan’s paper, so we can break this into a new topic if the answer isn’t simple.