Only a decade ago eye- and gaze-tracking technologies
using cumbersome and expensive equipment were confined to university research
labs. However, rapid technological advancements (increased processor speed,
advanced digital video processing) and mass production have both lowered
the cost and dramatically increased the efficacy of eye- and gaze-tracking
equipment. This opens up a whole new area of interaction mechanisms with
museum content. In this paper I will describe a conceptual framework for
an interface, designed for use in museums and galleries, which is based
on non-invasive tracking of a viewer's gaze direction. Following the simple
premise that prolonged visual fixation is an indication of a viewer's
interest, I dubbed this approach intention-based interface.

Keywords: eye tracking, gaze tracking, intention-based
interface

Introduction

In humans, gaze direction is probably the oldest and earliest
means of communication at a distance. Parents of young infants
are often trying to 'decode' from an infant's gaze direction the
needs and interest of their child. Thus, gaze direction can be viewed
as a first instance of pointing. A number of developmental studies
(Scaife and Bruner 1975; Corkum and Moore, 1988; Moore 1999 ) show that
even very young infants actively follow and respond to the gaze direction
of their caregivers. The biological significance of eye movements and
gaze direction in humans is illustrated by the fact that humans, unlike
other primates, have visible white area (sclera) around the pigmented
part of the eye (iris, covered by transparent cornea, see Figure 1). This
makes even discrete shifts of gaze direction very noticeable (as is painfully
obvious in cases of 'lazy eye').

Figure 1. Comparison of human and non-human eye (chimpanzee).
Although many animals have pigmentation that accentuates the eyes, the
visible white area of human eye makes it easier to interpret the gaze
direction

Eye contact is one of the first behaviors to develop in
young infants. Within the first few days of life, infants are capable
of focusing on their caregiver's eyes (Infants are physiologically shortsighted
with the ideal focusing distance of 25-40 cm. This distance corresponds
to the distance between the mother's and infant's eyes when the baby is
held at the breast level. Everything else is conveniently a blur. Within
the first few weeks, establishing eye contact with the caregiver produces
a smiling reaction (Stewart & Logan, 1998). Eye contact and gaze direction
continue to play a significant role in social communication throughout
life. Examples include:

regulating conversation flow;

regulating intimacy levels;

indicating interest or disinterest;

seeking feedback;

expressing emotions;

influencing;

signaling and regulating social
hierarchy;

indicating submissiveness or
dominance;

Thus, it is safe to assume that humans have a large number
of behaviors associated with eye movements and gaze direction. Some
of these are innate (orientation reflex, social regulation), and some
are learned (extracting information from printed text, interpreting
traffic signs).

Our relationship with works of art is essentially a social
and intimate one. In the context of designing a gaze tracking-based
interface with cultural heritage information, innate visual behaviors
may play a significant role precisely because they are social and emotional
in nature and have the potential to elicit a reaction external
to the viewer. In this paper I will provide a conceptual framework for
the design of gaze‑based interactions with cultural heritage information
using the digital medium. Before we proceed, it is necessary to clarify
some of the basic physiological and technological terms related to eye-
and gaze-tracking.

Eye Movements and Visual Perception

While we are observing the world, our subjective experience
is that of a smooth, uninterrupted flow of information and a sense of
the wholeness of the visual field. This, however, contrasts sharply
with what actually happens during visual perception. Our eyes are stable
only for brief periods of time (200-300 milliseconds) called fixations.
Fixations are interspersed by rapid, jerky movements called saccades.
During these movements no new visual information is acquired. Furthermore,
the information gained during the periods of fixations is clear and
detailed only in a small area of the visual field - about 2° of
visual angle. Practically, this corresponds to the area covered by one's
thumb at arm's length. The rest of the visual field is fuzzy but provides
enough information for the brain to plan the location of the next fixation
point. The problems that arise because of the discrepancy between our
subjective experience and the data gained by using eye-tracking techniques
can be illustrated by the following example:

Figure 1

The sentence above is a classical example of a "garden
path" sentence that (as you probably have experienced) initially leads
the reader to a wrong interpretation (Bever, 1970). The eye-tracking
data provide information about the sequence of fixations (numbered 1
to 7) and their duration in milliseconds. The data above provide some
clues about the relationship between visual analysis during reading
and eye movements. For example, notice the presence of two retrograde
saccades (numbered 6 and 7) that happened after initial reading
of the sentence. They more than double the total fixation time
of the part of the sentence necessary for disambiguation of its meaning.
Nowadays there is a general consensus in the eye-tracking community
that the number and the duration of fixations are related to the cognitive
load imposed during visual analysis.

Path (1) corresponds to free exploration. Path (2) was
obtained when subjects were asked to judge the material status of the
family, and path (3) when they were asked to guess the age of different
individuals. Partially reproduced from Yarbus, A. L. (1967)

Eye-tracking studies of reading are very complex but have
the advantage of allowing fine control of different aspects of the visual
stimuli (complexity, length, exposure time, etc.). Interpretation of
eye movement data during scene analysis is more complicated because
visual exploration strategy is heavily dependent on the context of
exploration. Data (Figure 2) from an often-cited study by Yarbus
(1967) illustrate differences in visual exploration paths during interpretation
of Ilya Repin's painting "They did not expect him, or "the unexpected
guest".

Brief History of Eye- and Gaze-Tracking

The history of documented eye- and gaze-tracking studies
is over a hundred years old (Javal, 1878). It is a history of technological
and theoretical advances where progress in either area would influence
the other, often producing a burst of research activity that would subsequently
subside due to the uncovering of a host of new problems associated with
the practical uses of eye-tracking.

Not surprisingly, the first eye-tracking
studies used other humans as tracking instruments by utilizing strategically
positioned mirrors to infer gaze direction. Experienced psychotherapists
(and socially adept individuals) still use this technique, which, however
imperfect it may seem, may yield a surprising amount of useful information.
Advancements in photography led to the development of a technique based
on capturing the light reflected from the cornea on photographic plate
(Dodge & Cline, 1901). Some of these techniques were fairly
invasive, requiring placement of a reflective white dot directly onto
the eye of the viewer (Jud, McAllister & Steel, 1905) or a tiny
mirror, attached to the eye with a small suction cup (Yarbus, 1967).
In the field of medicine a technique was developed (electro-oculography,
still in use for certain diagnostic procedures) that allowed registering
of eyeball movements using a number of electrodes positioned around
the eye. Most of the described techniques required the viewer's head
to be motionless during eye tracking and used a variety of devices like
chin rests, head straps and bite-bars to constrain the head movements.
The major innovation in eye tracking was the invention of a head-mounted
eye tracker (Hartridge & Thompson, 1948). With technological advances
that reduced the weight and size of an eye tracker to that of a laptop
computer, this technique is still widely used.

Most eye tracking techniques developed before the 1970s
were further constrained by the fact that data analysis was possible
only after the act of viewing. It was the advent of mini- and
microcomputers that made possible real-time eye tracking. Although widely
used in studies of perceptual and cognitive processes, it was only with
the proliferation of personal computers in the 1980s that eye tracking
was applied as an instrument for the evaluation of human-computer interaction
(Card, 1984). Around the same time, the first proposals for the use
of eye tracking as a means for user-computer communication appeared,
focusing mostly on users with special needs (Hutchinson, 1989; Levine,
1981). Promoted by rapid technological advancements, this trend continued,
and in the past decade a substantial amount of effort and money was
devoted to the development of eye- and gaze-tracking mechanisms for
human-computer interaction (Vertegaal, 1999;Jacob, 1991; Zhai, Morimoto
& Ihde, 1999). Detailed analysis of these studies is beyond the
scope of this paper, and I will refer to them only insofar as they provide
reference points to my proposed design. Interested readers are encouraged
to consult several excellent publications that deal with the topic in
much greater detail (Duchowsky, 2002; Jacob, Karn, 2003 /in press/).

Eye and Gaze Tracking in a Museum Context

The use of eye and gaze tracking in a museum context extends
beyond interactions with the digital medium. Eye tracking data can prove
to be extremely useful in revealing how humans observe real artifacts
in a museum setting. The sample data and the methodology from a recent
experiment conducted in the National Gallery in London (in conjunction
with the Institute for Behavioural Studies) can be seen on the Web. Although some of
my proposed gaze-based interaction solutions can be applied to the viewing
of real artifacts (for example, to get more information about particular
detail that a viewer is interested in), the main focus of my discussion
will be on the development of affordable and intuitive gaze-based interaction
mechanisms with(in) the digital medium. The main reason for this decision
is the issue of accessibility to cultural heritage information. Although
an impressive 4000 people participated in the National Gallery experiment,
they all had to be there at certain time. I am not disputing
the value of experiencing the real artifact, but the introduction of
the digital medium has dramatically shifted the role of museums from
collection & preservation to dissemination & exploration.
Recent advancements in Web-based technologies make it possible for museums
to develop tools (and social contexts) that allow them to serve as centers
of knowledge transfer for both local and virtual communities. My proposal
will focus on three issues:

problems associated with use of gaze
tracking data as interaction mechanism;

conceptual framework for the development
of gaze-based interface;

currently existing (and affordable) technologies
that could support non-intrusive eye and gaze tracking in a museum context.

Problems associated with gaze tracking input as
an interaction mechanism

The main problem associated with use of eye movements
and gaze direction as an interaction mechanism is known in the literature
as "Midas touch" or "the clutch" problem (Jacob, 1993). In simple terms,
the problem is that if looking at something should trigger an action,
one would be triggering this action even by just observing a particular
element on the display (or projection). The problem has been addressed
numerous times in literature, and there are many proposed technical
solutions. Detailed analysis and overview of these solutions is beyond
the scope of this paper. I will present here only a few illustrative
examples.

One of the solutions to the Midas Touch problem,
one developed by Risø National Research Laboratory, was to separate
the gaze-responsive area from the observed object. The switch (aptly
named EyeCon) is a square button placed next to the object that one
wants to interact with. When the button is focused (ordinarily for half
a second), it 'acknowledges' the viewer's intent to interact with
an animated sequence depicting a gradually closing eye. The completely
closed eye is equivalent to the pressing of a button (see Figure 3).

One of the problems with this technique comes from the
very solution -- it is the separation of selection and action.
The other problem is the interruption of the flow of interaction
– in order to select (interact with) an object, the user has to
focus on the action button for a period of time. This undermines the
unique quality of gaze direction as the fastest and natural way of pointing
and selection (focus).

Another solution to the same problem (with very promising
results) was to provide the 'clutch' for interaction through another
modality - voice (Glenn, Iavecchia, Ross, Stokes, Weiland, Weiss, Zakland
1986) or manual (Zhai, Morimoto, Ihde 1999) input.

The second major problem with eye movement input is the
sheer volume of data collected during eye-tracking and its meaningful
analysis. Since individual fixations carry very little meaning on their
own, a wide range of eye tracking metrics has been developed in the
past 50 years. An excellent and very detailed overview of these metrics
can be found in Jacob (2003/in print). Here, I will mention only a few
that may be used to infer viewer's interest or intent:

number of fixations:
a concentration of a large number of fixations in a certain area may
be related to a user's interest in the object or detail presented in
that area when viewing a scene (or a painting). Repeated, retrograde
fixations on a certain word while reading text are taken to be indicators
of increased processing load (Just, Carpenter 1976).

gaze duration: gaze is
defined as a number of consecutive fixations in an area of interest.
Gaze duration is the total of fixation durations in a particular area.

number of gazes: this
is probably a more meaningful metric than the number of fixations. Combined
with gaze duration, it may be indicative of a viewer's interest.

scan path: the scan path
is a line connecting consecutive fixations (see Figure 2, for example).
It can be revealing of a viewer's visual exploration strategies and
is often very different in experts and novices.

The problem of finding the right metric for interpretation
of eye movements in a gallery/museum setting is more difficult than
in a conventional research setting because of the complexity of the
visual stimuli and the wide individual differences of users. However,
the problem may be made easier to solve by dramatically constraining
the number of interactions offered by a particular application and making
them correspond to the user's expectations. For example, one of the
applications of the interface I will propose is a simple gaze-based
browsing mechanism that allows the viewer to quickly and effortlessly
leaf through a museum collection (even if he/she is a quadriplegic and
has retained only the ability to move the eyes).

Gaze-based interface for museum content

Needless to say, even a gaze-based interface that is specifically
designed for museum use has to provide a solution for general problems
associated with the use of eye movement-based interactions. I will approach
this issue by analyzing three different strategies that may lead to
the solution of the Midas touch problem. These strategies differ
in terms of the of the interaction mechanism, as it relates to:

time

location, and

user action

It is clear that any interaction involves time,
space and actions, so the above classification should be taken to refer
to the key component of the interface solution. Each of these solutions
has to accommodate two modes of operation:

the observation mode, and

the action (command) mode

The viewer should have a clear indication as to which
mode is currently active, and the interaction mechanism should provide
a way to switch between the modes quickly and effortlessly.

Time-based interfaces

At first glance, a time-based interface seems like a
good choice (evident even for myself when choosing the title of this
paper). An ideal setup (for which I will provide more details in the
following sections) for this type of interface would be a high-resolution
projection of a painting on the screen with an eye-tracking system
concealed in a small barrier in front of the user. An illustration of
a time-based interaction mechanism is provided in Figure 4. The gaze
location is indicated by a traditional cursor as long as it remains
in a non-active (in this case, outside of the painting) area. When the
user shifts the gaze to the gaze-sensitive object (painting), the cursor
changes its shape to a faint circle, indicating that the observed object
is aware of the user's attention. I have chosen the circle shape
because it does not interfere with the viewer's observation, even though
it clearly indicates potential interaction. As long as the viewer continues
visual exploration of the painting there is no change in status. However,
if the viewer decides to focus on a certain area for a predetermined
period of time (600 ms), the cursor/circle starts to shrink (zoom),
indicating the beginning of the focusing procedure.

Figure 4. The cursor changes at position (A) into
focus area indicating that the object is 'hot'.

Position (B) marks the period of relative immobility of
the cursor and the beginning of the focusing procedure. Relative change
in the size of the focus area (C) indicates that focusing is taking
place. The appearance of concentric circles at time (D) indicates imminent
action. The viewer can exit the focusing sequence at any time by moving
the point of observation outside of the current focus area.

If the viewer continues to fixate
on the area of interest, the focusing procedure continues for the next
400 milliseconds, ending with a 200 millisecond long signal of imminent
action. At any time during the focusing sequence (including the
imminent action signal), the viewer can return to observation mode by
moving the gaze away from the current fixation point. In the scenario
depicted above (and in general, for time-based interactions) it is desirable
to have only one pre-specified action relevant to the context
of viewing. For example, the action can be that of zooming-in to the
observed detail of the painting (see Figure 6), or proceeding to the
next item in the museum collection. The drawbacks of time-based interaction
solutions triggered by focusing on the object/area of interest
areas follows:

the problem of going back
to observation mode. This means that the action triggered by focusing
on a certain area has to be either self-terminating (as is the
case with the 'display the next artifact' action, where the application
switches automatically back to the observation mode) ,or one has to
provide a simple mechanism that would allow the viewer to return to
the observation mode (for example, by moving the gaze focus outside
of the object boundary);

the problem of choice between
multiple actions. Using the time-based mechanism, it is possible
to trigger different actions. By changing the cursor/focus shape, one
can also indicate to the viewer which action is going to take place.
However, since the actions are tied to the objects themselves, the viewer
essentially has no choice but to accept the pre-specified action. This
may not be a problem in a context where pre-specified actions are meaningful
and correspond to the viewer's expectations. However, it does limit
the number of actions one can 'pack' into an application and can create
confusion in cases where two instances of focusing on the same object
may trigger off different actions.

the problem of interrupted
flow or waiting. Inherent to time-based solutions is the
problem that the viewer always has to wait for an action to be executed.
In my experience, after getting acquainted with the interaction mechanism,
the waiting time becomes subjectively longer (because the users know
what to expect) and often leads to frustration. The problem can be diminished
to some extent by progressively shortening the duration of focusing
necessary to trigger the action. However, at some point it can lead
to another source of frustration since the viewer may be forced to constantly
shift the gaze around in order to stay in the observation mode.

Inspite of the above mentioned problems, time-based gaze
interactions can be an effective solution for museum use where longer
observation of an area of interest provides the viewer with more information.
Another useful approach is to use the gaze direction as input for the
delivery of additional information through another modality. In this
case, the viewer does not need to get visual feedback related to his/her
eye movements (which can be distracting on its own). Instead, focusing
to an area of interest may trigger voice narration related to viewer's
interest. For an example of this technique in the creation of a gaze-guided
interactive narrative, see Starker & Bolt (1990).

Location-based interfaces

Another traditional way of solving the "clutch" problem
in gaze-based interfaces is by separating the modes of observation and
action by using controls that are in the proximity of the area of interest
but do not interfere with visual inspection. I have already described
EyeCons (Figure 3) designed by the Risø National Research Laboratory
in Denmark (for a detailed description see Glenstrup and Engell-Nielsen,
1995). In the following section I will first expand on EyeCons design
and then propose another location-based interaction mechanism. The first
approach is illustrated in Figure 5.

Figure 5. Movement of the cursor (A) into the gaze-sensitive
area (B) slides into view the action palette (C).

Fixating any of the buttons is equivalent to a
button press and chooses the specified action which is executed without
delay when the gaze returns to the object of interest. The viewer can
also return to observation mode by choosing no action button. The action
palette slides out of view as soon as the gaze moves out of the area
(B).

The observation area (the drawing) and the controls (buttons)
are separated. At first glance, the design seems very similar to that
of the EyeCons, but there are some enhancements that make the interactions
more efficient. First, the controls (buttons) are located on a configurable
'sliding palette', a mechanism that was adopted by the most widely used
operating system (Windows) in order to provide users with more 'screen
real estate'. The reason for doing this in a museum context is also
to minimize the level of distraction while observing the artifact. Shifting
the gaze to the side of the projection space (B) slides the action palette
into the view. The button that is currently focused becomes immediately
active (D) signaling the change of mode by displaying the focus ring
and changing the color. This is a significant difference compared to
the EyeCons design, which combines both location- and time-based mechanisms
to initiate action. Moving the gaze back to the object leads to the
execution of specified action (selection, moving, etc.). Figure 6 illustrates
the outcome of choosing the 'zoom' action from the palette. The eye-guided
cursor becomes a magnifying glass allowing close inspection of the artifact.

Figure 6. After choosing the desired action (see Figure
5), returning the gaze to the object executes the action without delay.
The detail above shows the 'zoom-in' tool, which becomes 'tied' to the
viewer's gaze and allows close inspection of the artifact.

One can conceptually expand location-based interactions
by introducing the concept of an active surface. Buttons can
be viewed as being essentially single-action locations (switches). It
really does not matter which part of the button one is focusing on (or
physically pressing) – the outcome is always the same. In contrast,
a surface affords assigning meaning to a series of locations (fixations)
and makes possible incremental manipulation of an object.

Figure 7 provides an example of a surface-based interaction
mechanism. Interactive surfaces are discretely marked on the area surrounding
the object. For the purpose of illustration, a viewer's scan path (A)
is shown superimposed over the object and indicates gaze movement towards
the interactive surface. Entering the active area is marked by the appearance
of a cursor in a shape that is indicative of the possible action (D).
The appearance of the cursor is followed by a brief latency period (200-300
ms) during which the viewer can return to the observation mode by moving
the gaze outside of the active area. If the focus remains in the active
area (see Figure 8), any movement of the cursor along the longest axis
of the area will be incrementally mapped onto an action sequence
– in this case, rotation of the object.

Figure 7. Surface-based interaction mechanism. Viewer's
scanpath is visible at (A). Two interactive surfaces (B and C) are discretely
marked on the projection. Moving the gaze into the area of interactive
surface is marked by appearance of cursor with the shape indicative
of possible action (D).

Figure 8. If the viewer's gaze (as indicated by cursor
position at A) remains within interactive surface (B), any gaze movement
within the surface will lead to incremental action – in this case
rotation of the object (C).

The advantages of surface-based interaction mechanisms
are the introduction of more complex, incremental action sequences into
eye movement input and the possibility of rapid shifts between the observation
and action modes. The drawback is that the number of actions is limited
and that the surfaces, although visually non-intrusive, still claim
a substantial portion of the display.

Action-based interfaces

Building on the previous two models, one can further expand
the conceptual framework for gaze-based interfaces. This time I will
focus on the gaze action as a mechanism for switching between
the observation and the active (command) mode. Analysis of the previously
described surface-based model reveals that it can be described as an
intermediary step between the surface- and action-based interfaces.
In this model, although the shift between the observation and action
mode is dependent on the location of gaze focus, the control
of interaction is based on gaze action (moving the focus/cursor over
gaze-sensitive surface). Thus, the last step in our analysis is to explore
the possibility of using predominantly gaze-based actions as
a control mechanism. This may seem like slippery ground because physiologically
our visual behavior is mostly geared towards collecting information
and not acting upon the world. The exception of a kind is in the domain
of sexual and social behaviors where gaze direction and duration may
literally have physical consequences by signaling attraction, dominance,
submissiveness, etc. Fine literature abounds with examples describing
gazes as having a tangible effect ("his piercing gaze," "he felt her
gaze boring two little holes at the back of his neck…," "her angry
gaze was whipping across the room trying to find out who did this to
her.." to mention a few). Our ability to transfer knowledge from one
sensory domain to another modality will be the key component in the
proposed outline of an action-based gaze interface.

In eye-tracking literature, a gaze is most often defined
as a number of consecutive fixations in a certain area. This metric
emphasizes the location and the duration characteristics
of the gaze and can be extremely useful in inferring the viewer's interest
or gauging the complexity of the stimulus. However, in my proposal I
would like to focus on two, often neglected, characteristics of a moving
gaze that can be consciously used by a viewer to indicate his/her
intention. These are:

the direction of gaze movement,
and

the speed of gaze movement.

For technical purposes a moving gaze can be defined as
a number of consecutive fixations progressing in the same direction.
It corresponds roughly to longer, straight parts of a scan path and
is occasionally referred to as a sweep (Altonen et al. 1998).
The reason for choosing these characteristics is twofold. First, eyes
can move much faster than the hand (and there is evidence from literature
that eye-pointing is significantly faster than mouse pointing, see Sibert
and Jacob 2000). Second, as mentioned before, directional gaze movement
is often used in social communication. For example, we often indicate
in a conversation exactly 'who' we are talking about by repeatedly shifting
the gaze in the direction of the person in question.

In order to create an efficient gaze-based interface,
one has to be able to replicate the basic mouse-based actions used in
the traditional graphical user interface (GUI). These are: pointing
(cursor over), selection (mouse down), dragging (mouse
down + move) and dropping (mouse up). I will also propose the
inclusion of yet another non-traditional action, which I introduced
in interface design a while ago (Milekic, 2000) and which proved to
work extremely well as an intuitive browsing mechanism. This is the
action of throwing which is dependent on the speed of movement
of a selected object. Compared to the traditional interface, the throwing
action is an expansion of the action of dragging an object. As long
as the speed of dragging remains within a certain limit, one can move
an object anywhere on the screen and drop it at desired location. However,
if one 'flicks' the object in any direction, the object is released
and literally 'flies away' (most often, to be replaced by another object).
I have implemented this mechanism in a variety of mouse-, touchscreen-
and gesture-based installations in museums and it has been successfully
used by widely diverse audiences, including very young children. Subjectively,
the action is very intuitive and natural, and the feeling can be best
compared to that of sliding a glass on a polished surface (a skill that
many bar tenders hone to perfection). In the following sections I will
describe each of the gaze-based actions.

Gaze-pointing (Figure 9) is the easiest function
to replicate in a gaze-based interface. It essentially consists of a
visual clue that indicates to the viewer which area of the display is
currently observed. Although one can use the traditional cursor for
this purpose, it is desirable to design a cursor that will not interfere
with observation. Dynamic change of cursor shape when moving over different
objects can also be used to indicate whether an object is gaze-sensitive
and to specify the type of action one can initiate (this technique is
used in surface-based interface, described above; see Figure 4, for
example). I have chosen a simple dashed circle as an indicator of the
current gaze location. Pointing action is maintained as long as there
are no sudden substantial changes in a specific gaze direction. If such
a change occurs, the tracking algorithm determines the direction
of gaze movement and, if necessary, initiates appropriate action.

Figure 9. Gaze-pointing. The viewer can observe the
artifact with the pointing cursor (dashed circle) indicating the current
gaze location. Sweeping gazes across the scene are possible as long
as they are not in upward direction and end in the 30° angle strip.

This does not mean that the viewer is limited to slow
(and unnatural) observation. In fact, switching from observation to
action mode (selection) occurs only if movement of sufficient
amplitude occurs in an upward direction and ends up in a fairly narrow
area spanning approximately 30° above the current focus area. This means
that viewers can, more or less, maintain a normal observation pattern,
even if it includes sweeping gaze shifts, as long as they don't end
up in the critical area.

Gaze-selection (Figure 10) is an action initiated
by a sudden upward gaze shift. The action is best described (and subjectively
feels like) the act of upward stabbing, or 'hooking' of the object.
In a mouse-based interface the selection is a separate action –
that is, one can just select an object, or select-drag-drop it somewhere
else, or de-select it. In a gaze-based interface, what happens after
the selection of an object will depend on the context of viewing. When
multiple objects are displayed, the selection mechanism can act as a
self-terminating action, making it possible for the viewer to select
a subset of objects. In this case, highlighting the object would indicate
the selection. However, in the museum context (assuming that the viewers
will most often engage in observation of a single artifact) object selection
may just be a prelude to the action of moving (dragging). In this case
the object becomes, figuratively speaking, 'hooked' to the end of the
viewer's gaze, as indicated by a change of the cursor's shape to that
of a target.

Figure 10. Gaze-selection. Shifting the gaze rapidly
upwards within the 30° triggers of the selection process. The cursor
changes the shape to that of a target and positions itself at the center
of the object as a prelude to the action of gaze-dragging.

Gaze-dragging (Figure 12). Once the object has
been selected ('hooked' to the viewer's gaze), it will follow the viewer's
gaze until it is 'dropped' at another location. This action is meaningful
in cases when the activity involves the repositioning of multiple objects
(for example, assembling a puzzle). In the scenario depicted above,
the viewer can 'throw away' the current object and get a new one.

Figure 11. Gaze-dragging. The painting is 'hooked'
to viewer's gaze and follows its direction. At this stage the viewer
can decide either to 'drop' the painting at another location (see Figure
12) or, 'throw' away the current one and get a new artifact.

Figure 12. Gaze-dropping. The action of dropping an
object is the opposite of 'hooking' it. A quick downward gaze movement
releases the object and switches the application into observation mode.

Gaze-throwing (Figure 13) is a new interaction
mechanism that allows efficient browsing of visual data bases with a
variety of input devices, including gaze input. An object that has been
previously selected ("hooked") will follow the viewer's gaze as long
as the speed of movement does not exceed a certain threshold. A quick
glance to the left or the right will release the object and it will
'fly away' from the display to be replaced by a new artifact.

Figure 13. Gaze-throwing. 'Throwing' an object away
is accomplished by moving the gaze rapidly to the left or to the right.
Once the object reaches threshold speed it is released and 'flies away'.
A new artifact floats to the center of display.

The objects appear in a sequential order, so if a viewer
accidentally throws an object away, it can be recovered by throwing
the next object in the opposite direction.

To summarize, action-based gaze input mechanisms have
the advantage of allowing the viewer to act upon the object at
will, without time or location constraints. The mechanism is simple
and intuitive because it is analogous to natural actions in other modalities.
The best way to think about action-based gaze input is as a kind of
eye-graffiti. The vocabulary of suggested gaze-gestures
for eye input is presented in Figure 14. It is similar to the text input
mechanism used for Palm personal organizers where the letters of the
alphabet are reduced to corresponding simplified gestures. The fact
that millions of users were able to adopt this quick and efficient text
input mechanism is an indication that the development of eye-graffiti
has significant potential for gaze based interfaces.

Figure 14. Eye-graffiti. Top row presents graffiti
used for text input (letters A,B,C,D,E,F respectively) in Palm OS based
personal organizers. Bottom row outlines suggested gaze-gestures that
trigger different actions once the object has been selected.

The dashed circle in the illustration above does not represent
the visual representation of the cursor, but rather the area used to
calculate the direction and the velocity of gaze movement by the tracking
algorithm. The heavy dot indicates the starting point of a gesture.
However, while action-based gaze input mechanism may seem best suited
for museum applications, the ideal interface is probably a measured
combination of all three approaches.

Current Technologies for Non-Intrusive Eye Tracking

Unlike in the laboratory environments, the eye-tracking
technology used in a museum setting has to meet additional specific
requirements. Some of the most obvious ones are:

it should be non-intrusive.
This excludes all eye-tracking devices that use goggles, head-straps,
chin-rests or such.

it should allow natural head
movements that occur during viewing.

it should not require individual
calibration.

it should be able to perform
with a wide variety of eye shapes, contact lenses or glasses.

it should be portable.

it should be affordable.

With the ncreasing processor speeds of currently available
personal computers, it seems that the most promising eye-tracking technology
is that based on digital video analysis of eye movements. The most commonly
used approach in video-based eye tracking is to calculate the angle
of the visual axis (and the location of the fixation point on the display
surface) by tracking the relative position of the pupil and a speck
of light reflected from the cornea, technically known as the "glint"
(see Figure 15). The accuracy of the system can be further enhanced
by illuminating the eye(s) with low-level infra-red lightto produce
the "bright pupil" effect and make the video image easier to process
(B in Figure 15). Infrared light is harmless and invisible to the user.

Figure 15. Gaze direction can be calculated by comparing
the relative position and the relationship between the pupil (A) and
corneal reflection – the glint (C). Infra-red illumination of
the eye produces the 'bright pupil' effect (B) and makes the tracking
easier.

Figure 16. Several manufacturers produce portable
eye-tracking systems similar to the one depicted above. While the camera
position is most often bellow the eye level (eyelids interfere with
tracking from above), the shape and position of infrared illuminators
vary from manufacturer to manufacturer.

A typical and portable eye-tracking system similar to
the ones commercially available is depicted in Figure 16. Since the
purpose of this paper is not to endorse any particular manufacturer,
I urge interested readers to consult the large Eye Movement Equipment
Database (EMED) available on the World Wide Web.
Keeping in mind that many museums and galleries have very modest budgets,
I will specifically address the issue of affordable eye-tracking
systems.

The price range of most commercially available eye-trackers
is between $5000 and $60.000, often with additional costs for custom
software development, setup etc. Although there are some exceptions,
the quality and the precision of the system tend to correlate with the
price. However, with the increasing speed of computer processors, greater
availability of cheap digital video cameras (like the ones used for
Web-based video conferencing) and, most importantly, the development
of sophisticated software for video signal analysis, it is becoming
possible to build eye-trackers within a price range comparable to that
of a new personal computer. Even though the cheaper systems have lower
spatial and temporal resolution when compared to the research equipment,
in a museum/gallery setting they may be used for different applications;
for example, for browsing a museum collection with additional information
provided by voice-overs. A more significant use would be providing access
to the museum content to visitors with special needs. An example of
a cost-effective solution based on a personal computer and a Web-cam
for eye-gaze assistive technology was recently described (Corno, Farinetti
and Signorile, 2002).

Most commercially available eye-tracking systems (including
the high-end ones) have two characteristics that make them less than
ideal for use in museums. These are:

the system has to be calibrated
for each individual user

even remote eye-trackers have
very low tolerance for head movements and require the viewer to hold
the head unnaturally still, or to use external support like head- or
chin-rests.

The solution lies in the development of software able
to perform eye-tracking data analysis in more natural viewing circumstances.
A recent report by Quiang and Zhiwei (2002) seems to be a step in the
right direction. Instead of using conventional approaches to gaze calibration,
they introduced a procedure based on neural networks that incorporates
natural head movements into gaze estimation and eliminates the need
for individual calibration.

The emergence of eye-tracking technologies based on a
personal computer equipped with a Web-cam and the development of software
that allows gaze tracking in natural circumstances open up a whole new
area for museum applications. The described technologies make Web-based
delivery of gaze-sensitive applications possible. This not only presents
an opportunity for a novel method of content delivery (and reaching
different groups of users with special needs) but also offers an incredible
possibility to collect, on a massive scale, data related to visual analysis
of museum artifacts. However, a word of caution is in order here. One
cannot overemphasize the importance of context in an eye-tracking
application (or, for that matter, in any application). In an appropriate
context, even a fairly simple setup can produce magical results, and
the use of the most expensive equipment can lead to viewer frustration
in a flawed application.

Conclusion

I have outlined a conceptual framework for the development
of a gaze-based interface for use in a museum context. The major component
of this interface is the introduction of gaze gestures
as a mechanism for performing intentional actions on observed objects.
In conjunction, an overview of suitable eye-tracking technologies was
presented with an emphasis on low cost solutions. The proposed mechanism
allows the development of novel and creative ways for content delivery
both in a museum setting and via the World Wide Web. An important benefit
of this approach is that it makes museum content (and not just
the building or the restrooms) accessible to a wide variety of populations
with special needs. It also offers the possibility of data-logging related
to visual observation on a massive scale. These records can be used
to further refine the content delivery mechanism and to promote our
understanding of both the psychological and the neurophysiological underpinnings
of our relationship with the Art.