Alan Turing
came up with a thought-experiment in 1950. It was a way to test a machine's ability to demonstrate intelligent behavior. The
Turing Test
has since become one of the hallmarks of artificial intelligence (AI) research, as well as a source of continual debate. Turing had been exploring the question of whether machines can think. To avoid the difficulty of defining "intelligence", he proposed taking a behaviorist stance: Can machines do what we humans do? [1] In the Turing Test, a human observer engages in a conversation (using text-chat only) with two hidden agents - one of them is a real human and the other is an AI program. Both the hidden human and AI program try to appear convincingly human. If the observer thinks that the AI program is a real human, the AI program passes the Turing Test.

Verbal language may be the ultimate indicator of human intelligence, but it may not be the most representative indicator of intelligence in the broadest sense. Intelligence may be better understood, not as something based on a system of abstract symbols, grammars, and logical decisions, but as something that emerges within an embodied, situated agent that must adapt within a complex, dynamic environment.

Referring to the Turing Test, N. Katherine Hayles, in
How We Became Posthuman, writes, "Here, at the inaugural moment of the computer age, the erasure of embodiment is performed so that 'intelligence' becomes a property of the formal manipulation of symbols rather than enaction in the human life-world" [2].
Hubert Dreyfus was one of the earliest and most forceful critics of AI (or, what some now call "Good Old-Fashioned AI"), claiming that a machine cannot achieve intelligence without a body. [3] This was the perspective of MIT robot-master
Rodney Brooks who famously said, "Elephants don't play chess". [4] Intelligence emerged out of a long evolution of brains evolving within bodies, in complex ecologies. Written language is a recent invention in the long evolutionary history of communicative behavior in the biosphere. Considering that intelligence (and therefore communication) arose from the deep evolution of life on earth, email, instant messaging, and virtual worlds came into existence in the blink of an eye. The classic Turing Test therefore addresses only a thin veneer of human behavior.

Regarding the creation of AI programs, if we can simulate at least some basic aspects of the embodied foundations of intelligence, we may be better prepared to then understand higher intelligence. It might also be useful for simulating believable behaviors in computer games and collaborative virtual environments. According to
Justine Cassell, while there is a trend towards ambient computing in which the computer becomes more invisible, and where background environments (like rooms in a house) become more intelligent, we still have a natural need to identify the physical source of an intelligence. Human bodies are the best example. "We need to locate intelligence, and this need poses problems for the invisible computer. The best example of located intelligence, of course, is the body." [5].

Instead of using disembodied text as the primary messaging alphabet for a Turing Test, what if we used something more primal, more fundamental? My answer is the Gestural Turing Test, a variation which uses a gestural alphabet. Here is how it works: a human subject sits in a chair to the right of a large room divider. In front of the subject is a large screen with two sets of three moving dots. A human controller sits silently in a chair to the left of the divider, also facing the screen. The movements of the three dots on the left of the screen are either created by the hidden human (who is wearing motion capture markers on his head and hands), or by an AI agent - a computer program animating the dots algorithmically. A random decision determines whether the motions will be controlled by the human or by the AI. The subject must decide which it is. The subject is wearing the same motion capture markers as the hidden human, and understands that he is controlling the dots on the right of the screen. The subject is told that the hidden agent "sees" his motions, and will try to interact gesturally.

Now, you may ask: what is there to discuss if you only have a few points to wave around in the air? In the classic Turing Test, you can bring up any subject and discuss it endlessly. There is just no way to discuss your grandmother's arthritis or the Civil War with three moving dots. So, what is the activity? What "game" does the observer play with dots? Well, the goal of the Turing Test is simply to fool a human subject into believing that an AI program is a human. Period. However that is accomplished is up to the subject and the AI program. Turing chose the medium of text chat, which is devoid of any visual or audible cues. Body language was thus not an option for Turing. In contrast, I propose a small set of moving dots, and no verbal communication. Moving dots are abstract visual elements (like the alphabet of written language), however, they are situated in time, and more intimately tied to the original energy of natural language.

One may argue that since we don't normally communicate with moving dots, there's not much basis to the idea of having "successful" communication using only dots. However, I would argue that the written word is just as abstract, and just as arbitrary. And don't be fooled by this number three. The fact that these dots move means that they are capable of an essentially infinite range of expressive motions and visual phrasings. The difficulty with considering three moving dots as a viable "alphabet" is based on the fact that we grew up with a very different alphabet: the alphabet of written language. It has permeated every corner of society, and so we don't question whether it is natural - in fact, we may not be capable of questioning it, because verbal literacy has become our primary virtual reality - a virtual reality by which other realities are referenced. And that includes me writing - and you reading - these words.

A Turing Test that uses body language creates a different dimension of possibilities. For instance, an AI program could be made to appear like a real person trying to communicate gesturally, as if to say, "Indeed, I am alive!" or, "Yes, I see you are gesturing at me". Perhaps the AI program could be programmed to spontaneously start playing an imitation game, which may turn into something like a geometrical question-and-answer activity. Once the personality and mood of the two agents (one human, one not) were mutually detected, the activity would become more nuanced, and certain expectations would come into effect. After extended time, it might be necessary for the AI program to interpret certain gestures semantically, to accommodate the spontaneous language genesis that the human subject would be forming. Whether or not these gestures were conscious signals given by the observer, the AI program might be designed to attribute meaning to them if they tended to occur often and in response to certain situations. Incidental motions could be used as well, such as the effects of scratching, stretching, shifting one's weight, (and other effects caused by a real human anatomy behind the dots), to lend some realism and casual believability to the experience. In fact, I would even suggest giving the AI program the option to stand up and walk off the screen, as if to say, "enough of this silly game - I'm bored". And why not? As far as I'm concerned, that is within the realm of body language allowed in the Gestural Turing Test.

Behavioral Realism
Since I am taking the Turing Test into a visual realm here, why am I proposing to use such a simple graphical representation? Why not render fully realistic virtual humans? One reason is that I want to avoid the
uncanny valley. The goal is not visual realism - it is behavioral realism. We have plenty of examples of image-based Turing Tests - depictions of virtual humans on the covers of Computer Graphics journals and Computer Game magazines. These images are starting to become indistinguishable from photographs of real people. More and more of them are passing the "Image Turing Test".

But human eyes and brains don't use images to apprehend a Living Other - because images don't include time. There is another reason I have proposed to use only dots: any attempt at visual realism would constitute a distraction to the goal of the test - which is about pure motion. How much visual detail can be removed from a signal while still leaving the essence of expressive motion? This is one of the questions asked by researchers who use
point light displays, a visualization technique that is often used to study
biological motion.

Research shows that people are really good at detecting when something is being moved by a living thing as opposed to being moved by a machine or by the wind.

Experiments using point light displays have been used on subjects to test perception of various kinds of motion. These experiments demonstrate that you don't need very many points of light for people to figure out what's going on. Researchers have even located neural structures in the brain that respond to the motions of living things as opposed to non-living things.

Believability
Variations of the Turing Test have been used as a testing methodology by several researchers to measure believability in virtual character motion [6][7][8]. "Believability" is an interesting concept. Film critics refer to the viewers'
suspension of disbelief. The concept is familiar to human-like robot makers and game AI designers. It plays into the world of
political campaigning.
Believability is in fact a fuzzy metric and it might be better to measure it, not as a binary yes or no, but as a matter of degree. It has been suggested by some critics of the classic Turing Test that its all-or-nothing criterion may be problematic, and that a graded assessment might be more appropriate [9].

Consider the amount of visual detail used. Several dots (dozens, perhaps) would make it easier for the human observer to discern between artificial and human. But this would require more sophisticated physical modeling of the human body, as well as a more sophisticated AI. I propose three points: the head and hands are the most motion-expressive points of the body. These are very mobile parts of the body, and most gestural emblems originate in these regions.

The graph shown here is similar to the uncanny valley graph. It illustrates the following hypothesis: as the number of points used to reveal motion increases, the subject's ability to detect whether it is real or not increases. Believability goes up for real humans, and it goes down for artificial humans. Or, to put it another way, humans become more identifiable as humans and artificial humans become more identifiable as artificial.

The implications of this are that the Gestural Turing Test could be deployed using many possible levels of visual realism. The more visual realism that is used, the more sophisticated the AI algorithms need to be to accommodate visual detail. Believability is not an absolute value or a universal constant - it varies among types of media, observers, and contexts.

The Experiment
Enough with all this pontificating, hand-waving, and thought-experimenting! In the fall of 2009, I was doing research and teaching a class at the
School for Interactive Art and Technology (SIAT)
, at Simon Fraser University in Vancouver BC.

I suggested to
Magy Seif El-Nasr, a professor and researcher at SIAT who I was working for at the time, that we implement a Gestural Turing Test and she said, "Let's do it". Magy does research on believability and nonverbal behavior in avatars and game characters [10].

With my collaborator, graduate student Bardia Aghabeigi, we put together a Gestural Turing Test using the
Vicon motion capture studio at
Emily Carr University of Art and Design
in Vancouver, managed by Rick Overington [11]. We designed a handful of AI algorithms that mimic the motions of human hand and head positions. Bardia implemented a network-based architecture allowing the AI to send motion data to the Vicon system where it could be picked up by a rendering engine and animated on a large screen, just as I explained earlier.

The modern
motion-capture studio is an impressive set-up. In the studio we used, the room is large, windowless, and is painted completely black. There are twenty cameras lined up at the top of the four walls, next to the ceiling. These cameras act like compound eyes that look down upon the human subjects from several viewpoints as they move about.

The Vicon system pieces together the data streams from the cameras, some of which may not see all the points at times because they may be hidden from view, and re-constructs the 3D geometry. We attached motion capture markers (highly-reflective little balls) to a couple of soft hats and gloves, and configured the Vicon system to expect these markers to appear on the heads and hands of two different people, separated by several meters. We grabbed one of those large soft pads that motion capture stunt actors are supposed to fall onto and not break bones, and we propped if up on its edge as a makeshift room divider. (It served as an acoustic divider as well) We set up two chairs, each facing the huge screen that projected the computer display of the markers.

Gestural AI
There were three hidden agents used to drive the dots on the left side of the screen: one human and two experimental AI algorithms. The AI algorithms were quite simple as far as AI programs go. To make them look as authentic as possible, we used a "background" layer of continual movement. This was meant to give the impression of a person sitting with hands hanging by the sides of the chair, or shifting to put the hands in the lap, or having one hand move over to the other hand as if to adjust the motion capture marker. Some of these background movements included scratching an itch in some nonspecific part of the body. A barely-perceptible amount of random motion was added to make the points appear as if they were attached to a living, breathing person. This is similar to a technique called
Perlin Noise
[12], named after Ken Perlin, who devised various techniques for adding random noise to graphical textures and synthetic motion.

The first AI algorithm used a set of pre-recorded motions as its vocabulary of gestures, and it used a simple blending scheme (the transitions between gestures was not very smooth). This was so that one of the AI programs would perform less reliably than the other, in order to give variation in believability. The other AI algorithm used an imitation scheme. When it detected movement in the subject that was over a specific threshold, it would start "watching" the gestures of the subject, then after a few seconds of this, it would play back what it "saw". It would blend smoothly between its background motions and its imitative gestures.

The Subjects
On the day of the experiment, I went out and hunted for students wandering the halls of Emily Carr University and asked them if they had ever been in a motion capture studio, and if they would like to be subjects in a psychological experiment. Most of them were eager and excited. I collect a total of 17 subjects. Each subject was invited to sit in the chair, put on the hat and gloves, and become accustomed to thinking of him or herself as a primitive avatar in the form of three dots. Each subject was asked to declare whether the other three dots were moved by a real human (which happened to be yours truly, sitting quietly on the other side of the divider), or one of the two AI programs, which were sitting even more quietly inside of Bardia's laptop, waiting for his command to start.

Each subject was given about six to twelve tests, and was told to take as long as he or she wanted to decide (we wanted to see if reaction time would be correlated with believability - it turns out that there was no correlation). Some subjects made their decisions within seconds (which were no more accurate than average) and some took longer, with a few going longer than a minute. Regarding the gesturing by the human subjects, some of them launched into bold symphonic gestures, while others timidly shifted their hands and head, barely creating any motion. Some subjects drew shapes in the air to get some visual language interaction going, and others just wobbled in place. Those subjects that displayed small, shy motions were met with equally shy AI behavior. The reason is that our AI algorithms were designed to detect a specific threshold of motion in the subjects, and these motions were not energetic enough to trigger a response. It was a bit like two shy people who are unable to get a conversation going.

Data
The following graphs show the results of these experiments.
They are explained in more detail in the
paper.
We ran 6 to 12 tests on 17 subject. A total of 168 tests were done. The graph at left shows the results in chronological order from top to bottom. In this graph, the set of tests per subject is delineated by a gray horizontal line. The length of the line is proportional to the duration it took for the subject to make a response. The longest duration was just over 95 seconds. If the response was "false", a black dot is shown at the right end of the line. Wrong guesses are indicated by black rectangles at the right-side of the graph.

The graph at right shows the percentages of wrong vs. right responses in the subjects for each of three agents (hidden human and two hidden AI programs - again, this is explained in the
paper).

Mirroring
As expected, most of the subjects could tell when the dots were being moved by the hidden human gesturer (me). Scientifically-speaking, it is a no-no for me to serve as the hidden human gesturer - as I am biased towards the results of the experiment. At any rate, I tried as best I could to look human (which was one of Turing's original rules). Also as expected, the imitative AI algorithm scored better than the first one, in fooling the subjects into thinking it was human. But we were surprised by how much it fooled the subjects: almost half of them declared that it was human. And even more interesting is the fact none of the subjects made any comment suggesting that they were being aped.

Our human ability for self-reflection is considerably more advanced, and effortless. Mirroring is part of how we become acculturated.
Mirror neurons
are activated when we imitate each other, or when we witness another person having an experience that we have had. Mirror neurons are even activated when we imagine a person (or ourselves) having an experience. Our mirror neuron system is so eager and willing to do its job that we might be easily duped by even the simplest imitation.

Even though simple imitation in a virtual agent is effective, it can lose its effect over time once the human subject notices the slightest bit of unjustified repetition, or when it becomes apparent that there is actually no communicating going on - which is bound to happen eventually.

The Point
I had originally suggested that the Gestural Turing test be done with one single point (I love a challenge). It's hard to detect any indication of human anatomy from a single point. My hypothesis is this: given enough time to interact, two humans who want to convince each other that they are alive will eventually succeed, even if all they have is a single point to move around. The existence of a human mind would eventually be revealed, and this would require a very human-like process of constructing, and then climbing, the ladder of
semiosis:
spontaneously generating the visuo-dynamic symbols with which to communicate. Whether or not an AI can achieve this is another matter entirely. But in either case, I believe that the apprehension of a living other could be achieved using pure motion. This is the essence of the Gestural Turing Test, as well as the basis for designing motion algorithms that make virtual characters more believable.