Unreal Tournament bots appear more human than humans

One bot was tagged as human by 52 percent of judges.

Two programming teams have created intelligent virtual gamers—or "bots— that have not only beaten the Turing test, but managed to be appear more human than human gamers.

The UT^2 bot, programmed by a team from the University of Texas, and MirrorBot, programmed by Romanian computer scientist Mihai Polceanu, split a top prize of $7,000 (£4,300) at The 2K BotPrize—a contest that has been challenging programmers since 2008 to create game bots that appear to be as human as possible, playing like fallible human gamers rather than near-perfect computer AI.

In the competition, computer-controlled bots created by programming teams from all over the world face off alongside human players, who act as judges, in the virtual battle zone of Unreal Tournament 2004. Any combatant a judge meets which they believe to be human is tagged with a "judging gun." After several rounds of combat, the bot that has received the most human tags wins the contest.

While the human players managed to gain an average "humanness" rating of 40 per cent, the UT^2 Bot and Mirror Bot both achieved a rating of 52 percent. This is the first time since the contest has been run that a bot has achieved the target score of 50 percent "humanness."

"A great deal of the challenge is in defining what 'human-like' is, and then setting constraints upon the neural networks so that they evolve toward that behavior," University of Texas doctoral student Jacob Schrum told his department website.

"If we just set the goal as eliminating one's enemies, a bot will evolve toward having perfect aim, which is not very human-like. So we impose constraints on the bot's aim, such that rapid movements and long distances decrease accuracy. By evolving for good performance under such behavioural constraints, the bot's skill is optimised within human limitations, resulting in behaviour that is good but still human-like."

Fittingly, the completion of The 2K BotPrize's challenge comes 100 years after Alan Turing posited his Turning test. Now that two bots have successfully achieved a 50 percent humanness rating, the 2K BotPrize team hope to create a new challenge for bot programmers.

Should any Unreal Tournament 2004 gamers wish to take on the UT^2 bot, the team have made their prize-winning player available here.

The more interesting question here is "Why did humans only score 40%?" The judges appear to have been using criteria that were worse than random. Why, I wonder. And how many samples were there? What's the signal to noise in this data? Are gamers so used to seeing computer AI that they're rendered worse judges of human-ness? Or is there something else at play.

This actually really excites me. Imagine if we could accomplish this level of human-ish-ness in other games. No longer would we be tied to the tyranny of rubber-band difficulty or bots that cheat (I'm looking at you, infinite resource RTS bots). Really that's what annoys me the most. Not bad AI but cheating AI.

Games like this restrict humans to very simple behavior - attacking and surviving inside a maze. Whatever humans or programs do there can hardly be a measure of "humanness". So this is more like "humans restricted to subhuman level judge a program to be better at it".

Ugh. Seriously? Yes, these bots are an awesome achievement. But this has absolutely nothing to do with passing a Turing test. It's not just a catch-all term for anything that simulates humans in any context. It specifically means having conversations that are indistinguishable from that of normal-thinking humans. Conversation was picked because it's so complex that it's hard to simulate, and yet we instantly notice errors. Gaming has none of this.

What this has to do with the Turing test is just that the game, by limiting interactions, makes it easier for the bot to pretend to be human, and thus gradually removing limits and then building up to meet them is a possible way to make a machine that passes a Turing test.

Which, BTW, does not mean a 50% humanness rating. The entire concept is that it has to reach 100%. If every other human can tell it's a bot, it isn't as smart as a human.

Also, I'm really getting tired of this buggy comment time detection system that will allow me to make comments easily say 5 minutes apart, and then all of the sudden won't let me make a post for like 30 minutes.

It wouldn't be so bad if they'd just use a captcha, but, no, I have to sit around and wait on buggy software.

Ugh. Seriously? Yes, these bots are an awesome achievement. But this has absolutely nothing to do with passing a Turing test. It's not just a catch-all term for anything that simulates humans in any context. It specifically means having conversations that are indistinguishable from that of normal-thinking humans. Conversation was picked because it's so complex that it's hard to simulate, and yet we instantly notice errors. Gaming has none of this.

This is true, but you don't know which of us commenters are bots, do you?

I would like to note that any player moderately familiar with UT2004 will correctly identify built in bots as bots. The fact that these got a 50% humanness rating doesnt exactly inspire confidence in the results.

Spectating bots will also give them away immediately as their aim pitch doesnt change.

EDIT: Fun Fact, according to their criteria, Epics built-in bots (at least some of those, especially the bad ones) passed the test and came in third behind the two competitors winning the competition.

Also, one human judge was apparently guessing (50% chance of judging correctly), and yet another made an effort to be wrong (<50% chance to be right when judging). A few bots judged better than these two humans.

I'm guessing nobody was allowed to chat, because otherwise all the judges would have to do is tag anyone using incoherent racial or sexual slurs as human.

That was my thought as well... just jump up and down randomly and call anyone that kills you a "fag" and you'll fit right in. Bonus points if the bots can be trained to "teabag" fallen opponents.

A good player of UT2004 will not only be able to tell bot from human, but good player from bad player. Jumping randomly is about the best indicator of a bad player. Calling others "fag" or teabagging is also a really good indicator of idiots.

Does not really surprise me... When "Professional" (I dont think this is a real profession by the way) gamers see players better than they are - they often accuse them of cheating or using bots... This is just human nature... The thing is being a human play gives you many advantages over bots - as right now bots really do not learn - they are programmed... Where when I play TF2 for instance, after playing with the same people for 10-15 minutes - you will start to see patterns in their gaming - and thus be able to read them and know with some amount of certainty what they will do next... A real good play will change their playing habits based on the people they are playing with. Thus many really good human players will often play better / cleaner than bots..

I'm guessing nobody was allowed to chat, because otherwise all the judges would have to do is tag anyone using incoherent racial or sexual slurs as human.

That's actually pretty easy to script a bot to do. You almost don't even need to take the usual step of making sure you don't repeat yourself too much...

I am seriously not impressed with all this, because over 20 years ago two of my college housemates wrote a bot for an early online game (netrek), that used:* "imperfect" aiming and navigation.* appropriate behavioral responses to in-game chat slang (like "focus fire on this guy!" or "protect planet X".* A library of randomly recorded chat logs that were filtered and replayed. (Mostly your "slurs", truth be told.)

There were also amusing touches like the bot dropping all targeting limitations during it's last 1% of life... which in practice was never noticed, but meant it always went down with an impact.

Still, main point is that this is not only not new, this is so not new that some of these programming students were probably still in diapers when it was already solved, using technology several orders of magnitude slower and with several orders of magnitude fewer resources.

The more interesting question here is "Why did humans only score 40%?" The judges appear to have been using criteria that were worse than random. Why, I wonder. And how many samples were there? What's the signal to noise in this data? Are gamers so used to seeing computer AI that they're rendered worse judges of human-ness? Or is there something else at play.

Each player received around 25 judgements. Because they were trying to find bots, the judges tend to be biased that way. What's new here is that in all previous competitions, humans were judged more human than bots.

I'm guessing nobody was allowed to chat, because otherwise all the judges would have to do is tag anyone using incoherent racial or sexual slurs as human.

That's actually pretty easy to script a bot to do. You almost don't even need to take the usual step of making sure you don't repeat yourself too much...

I am seriously not impressed with all this, because over 20 years ago two of my college housemates wrote a bot for an early online game (netrek), that used:* "imperfect" aiming and navigation.* appropriate behavioral responses to in-game chat slang (like "focus fire on this guy!" or "protect planet X".* A library of randomly recorded chat logs that were filtered and replayed. (Mostly your "slurs", truth be told.)

There were also amusing touches like the bot dropping all targeting limitations during it's last 1% of life... which in practice was never noticed, but meant it always went down with an impact.

Still, main point is that this is not only not new, this is so not new that some of these programming students were probably still in diapers when it was already solved, using technology several orders of magnitude slower and with several orders of magnitude fewer resources.

The more interesting question here is "Why did humans only score 40%?" The judges appear to have been using criteria that were worse than random. Why, I wonder. And how many samples were there? What's the signal to noise in this data? Are gamers so used to seeing computer AI that they're rendered worse judges of human-ness? Or is there something else at play.

Each player received around 25 judgements. Because they were trying to find bots, the judges tend to be biased that way. What's new here is that in all previous competitions, humans were judged more human than bots.

However, this time Epic's bots were also judged to be more human than actual human players (outdoing all but two new bots submitted to the competition).

EDIT: Dear god, i just watched the videos of judges playing. If youre this shit at the game, anything can be a bot or a human. You dont have enough experience to be able to tell humans and bots apart, which, again, is evidenced by the fact that these judges had difficulty identifying Epic's bots correctly.