well, the Facebook group is a very vivid community for discussing AI related topics, but in terms of sustainability this topic fits much better on this public forum. First of all this is the reference to my post inside the facebook group.

During my bachelor thesis, I'm examining the process of applying artificial neural networks to the behavior of single combat units. So far I implemented a complete training flow. And now it's time to examine the whole matter in order to produce some proper results. This is basically my very first serious project dealing with AI. Before I experimented with steering behaviors and some trials in order to develop AI for my game "Beastly Rivals Onslaught". So sometimes I experience some lack of experience in the field of AI. That's basically why I'm intending to discuss this whole matter on this forum. Of course it has nice positive side effects for everybody who's interested, but I'm definitely not intending to let you guys make the work of my thesis. With this being said, time for some details and questions:

My repository is located on GitHub.Right now the training environment is a 10 marines vs 10 marines fight on a blank map. Stimpack is available. I attached an image showing the flow chart of my app design.

flowchart.png (62.6 KiB) Viewed 934 times

So the very first thing which I'd like to talk about are the output actions. Right now I'm employing a state machine which executes the decisions made by the neural net (during training the action is randomly chosen). These are the current Output Actions:

So far one action is limited to some amount of frames. An attack needs 4 frames to be executed. Moving takes 7 frames right now. Activating the ability stimpack is a one frame action.

I'm not particularly sure if that's the way to go. Of course an action needs some amount of time in order to be executed. That sounds alright for moving. But it gets more complex concerning the execution of an Attack. The Attack animation itself probably just needs 3 frames until damage is dealt. But what about the case if the target is not in range? And what about the weapon cooldown? Is that action completed after one shot was landed? How would you guys define an action?

Stimpack is another interesting example. My training is based on a fitness function which measures the utility of one action. Basically the state before and after the action is getting compared. So since using Stimpack is a one frame action, the state got always better. Let me show this with one example which also displays how the fitness function works right now.

Given the scenario of 10 vs 10 marines on the very first frame of the match, where one unit chooses the action UseStimpack:

So the state, after stimpack has been enabled, has slightly improved. The ratio of the enemy state equals one since the state observation occurred during one frame. So the fitness function would keep recommending the usage of the action UseStimpack.

After that ratio math, the actual output of the neural net is getting multiplied by 1 + (friendRatio - enemyRatio) and then that output vector is getting backpropagated (trained).

That's basically all for now. It would be really great if some people could contribute to this matter with their experience about game ai. Especially talking about the definition of an action would be highly valueable!

Hey! I have been interested in the same topic for a while (training ANNs to command the micro behaviour of individual units) and I like it how you've laid out the architecture and explanations of your approach so far.

IMO, the current list of output actions that you make use of is sufficient. For the time being, I'd put more attention to the inputs to the first layer though, as those might have a crucial impact on any efficient training. I haven't yet had the time to take a look at your source code but do you consider stuff like the HPs of the current unit, distance to each enemy unit, whether positioned on high / low ground, etc.?

Wrt the expirations of each command, I think you're on the right track when considering the duration of each command. So If I were you, I'd just not touch the unit for X frames after an issued command, where X varies per command type, e.g. 4 frames for AttackEnemyUnitInRange, 12 frames for retreat, etc.

Also I am not sure if you already do this, but perhaps it makes sense to only train the network once the whole battle has finished (or a timeout of Y frames has occurred) and for the fitness function then consider something like total frames passed (a lower count is better if our side has won; more is better if the enemy side has won), own units' remaining HPs; enemy units' remaining HPs and so on. Perhaps, you already do this?

Keep us updated on any progress. I'll be very interested to follow the project (you've already got the star on GitHub )

Right now I'm planning on simplifying the current scenario. I had to debug and trace down some issues regarding the fitness function and some other stuff related to that. So far my fitness function wasn't able to generate proper training examples. After training, most output vectors would be like [1,1,1,1,1], meaning that any action would be great. So for some reason most of the actions taken are considered to be good.

Regarding Supreme Commander 2, all the sources, authored by Michael Robbins, are somewhat inconsistent. Since decrypting these information is pretty challenging, I believe that getting a simple case to work is a good base for increasing complexity.

By simplification, I'm thinking about to scratch stimpack usage and maybe to use a 1 vs 1 combat situation. In the blog post, Robbins mentioned that during training the individual units would only know about itself and the enemies, whether in execution mode it would gather information about the friendly units as well. It is also confusing that Robbins talks about delta values, because he used to write about ratios in his chapter, which is the most detailed source. Well, ratios make the most sense, since overall hp or a single units hp should work always as percentage (value inbetween 0 - 1). This is also a good approach for data normalization. So in the end I could simplify the information to local information. This makes somewhat sense about my earlier explained issue, there one unit would consider a stupid action to be great, because its mates went all out and dealt tons of damage, while that one particular unit was useless. I guess this tends to guide the training to that [1,1,1,1,1] output.

There is another confusing information given by Robbins. He first states that data is generated by comparing the state pre and post an action. Further down on a comment he talks about encounters instead.

And at last concerning my current training flow, at the end of each math, all the gathered training examples are backprogated just once.

I also modified the action Attack. As soon as one shot has been fired, the attack is completed. So moving into range belongs to that action.

The real issue should be within my Fitness Function, because the computed output keeps going towards [1,1,1]. It could be some stupid logical flaw. Or the whole approach is missing out some crucial point.

Always converging towards [1, 1, 1...] sounds like the NNs might not be getting enough important data (hence a random action is considered as good as any other) or the training process is somehow broken...

How about feeding it a bit more data like:1. current unit HPs2. the HPs of the enemy unit with the lowest HPs in range3. total number of enemy units that have the current unit in range4. total damage that those units could cause to the current unit5. total number of enemy units that do NOT yet have the current unit in range, but are just one move away (say their attack radius + 2 * 32 pixels)6. total HPs of allied units in range of the enemy units from point 3)

In Broodwar it doesn't seem logical to attack the strongest (having the most HPs) enemy unit, but it rather makes sense to eliminate enemy units as quickly as possible, so I'd say have the AttackWeakest instead of AttackClosest or have them both.

Besides, have you tested your ANN implementation on a much simpler problem like XOR or tic-tac-toe just to verify that training works in general?

Kinda off topic, but since I won't be able to compile your project any time soon, how about you streaming the training process? I'd be interested in watching the game play evolve?

BTW, you could message me at krasi000 at gmail or on FB messenger for faster communication if you like

I simplified the scenario to a 1 vs 1 fight between marines without stimpacks. Only the health of the friendly and the enemy units are taken as input. Outputs are attacking, moving towards and moving back.

I logged in detail how the fitness function did its work till the point there the outputs converge to 0 or 1. Think of a percentage by the numbers between 0 and 1, like 85% hit points left (that's basically the already normalized value).

That's one complete log of one fitness function execution. The action with the index 0 is Attack. It is already a flaw, that the value of Final Enemy is still one after the attack action. It should have been lower than that after dealing 6 damage to the health of the enemy. In contrast the following execution works with the loss of hit points:

So action 2 (Move Back) experienced the loss of hit points on both sides. In most of the cases, actions are measured with wrong -preaction and post-action information.

These mistakes cause tons of noise in-between the training data. Mike Robbins got interviewed by AiGameDev.com back then and he talked about noise. But rather of noise which was caused by such logical bugs, he talked about noise created by the units who know about the others. That's pretty much the scenario which I described previously. Units dealing damage, while others run around, will make the others feel good as well.

edit:

Actually Supreme Commander 2's output actions are all some kind of attack action. Attack weakest, most valueable, strongest, structure, ACU, ... So that kind of fitness function is able to tell if an attack was good or not. It is much harder to compliment a unit for defensive or evasive actions. Supre COmmander 2 treats this by looking at all the outputs and if all outputs are below 50%, the AI decides to retreat. So I guess my approach is now to go more into the direction of that approach. I probably have to add more different units in order to produce a better result. Otherwise, the current simplification would learn to attack in every situation, since defense will never be considered as good (> 1).

Thinking about all of this lead me to the insight, that moving at all can't be a good decision. The current information compares health. So while not being hit, I can't hit the enemy, so the adjustment ration will always be one or worse if the unit got hit.

Regarding the noise while training, I would still re-introduce an idea that I mentioned in my first post on this topic: "perhaps it makes sense to only train the network once the whole battle has finished (or a timeout of Y frames has occurred) and for the fitness function then consider something like total frames passed (a lower count is better if our side has won...". So basically this is a form of *offline* learning, i.e. during the battle, all units react using the same ANN; the battle finishes, then the ANN trains are adjusted accordingly to the outcome of the battle. Then a new battle is launched with the adjusted ANN weights and so on.I think it's worth giving a shot. Please report back if you do!

The neural net doesn't make the decisions during training. The output action is selected by random. Otherwise a fresh neural net would most likely classify any input to one output. Still before the random action is selected, the net has to be fed forward in order to generate training examples. Making random decisions makes sure that a broad range of training examples are covered.

All the gathered training examples of one match are iterated once for training on the conclusion of the match.

With noise I mean conflicting data. So there are data pairs which have the same inputs but are totally different on outputs. So there is always a huge training error.

The current state of my project is that I'm getting close to be done with implementing my feedback logic. So next week the training procedure should generate clean training examples.