Automatic hand gesture recognition using manifold learning

Automatic hand gesture
recognition using manifold
learning
Bob de Graaf
Thesis submitted in partial fulﬁllment of the requirements for the degree of
Master of Science in Artiﬁcial Intelligence
Maastricht University
Faculty of Humanities and Sciences
December 3rd, 2008
Acknowledgments
This thesis started several months ago, and has been successfully ﬁnished
due to several people I would like to thank for their support.
First of all, I would like to thank my supervisor Eric Postma for his guidance
during this thesis, even during the times his schedule was not allowing any
free time. It was always a pleasure to go to our meetings, and I generally left
laughing in addition to many inspiring ideas for further research. I would
also like to thank Laurens van der Maaten for his input on the general ap-
proach of my thesis. He was always very quick to answer with challenging
questions. Many thanks also go to Ronald Westra, for his full guidance dur-
ing the bachelor thesis. He always encouraged me to search further than I
would normally have, and helped me on several parts in this master thesis.
I would also like to thank several friends of mine; Koen, Pieter, Rob, Niels,
Roy, Michiel, Francois and Willemijn, who helped and supported me during
the demanding times of my study and various other times. They ensured I
had the most enjoyable study experience over the last four years and I am
much indebted to them for that. Several of them I would also like to thank
for their help in creating and developing the dataset that was used in this
research.
Particularly, I would like to thank my brothers and parents, who have always
supported me throughout my life and whos intelligent remarks and scientiﬁc
discussions always made every family member strive for excellence. I wish
my brothers good luck in their future scientiﬁc endeavours, being absolutely
sure they will succeed.
Abstract
Human-computer Interaction is nowadays still limited by an unnatural way
of communication, as users interact with their computer using an inter-
mediary system. The promising Perceptual User Interface strives to let
humans communicate with computers similarly to how they interact with
other humans, by including the implicit messages humans send using their
facial emotions and body language. Hand gestures are highly relevant in
communication through these non-verbal channels, and have therefore been
researched by several scientists over the last few decades. Currently, state-
of-the-art techniques are able to recognize hand gestures very well using
a vision-based system, analyzing the static frames to identify the diﬀerent
hand postures. However, evaluating only images limits their recognition
on several levels. Background objects, lighting conditions and the distance
of the hand in the frames aﬀect the recognition rate negatively. There-
fore, this thesis attempts to recognize hand gestures in videos by focusing
purely on the dynamics of gestures, by proposing a new technique called the
Gesture-Manifold method (GM-method). Considering only the motion of
hand gestures makes the approach largely invariant to distance, non-moving
background objects and lighting conditions.
A dataset of ﬁve diﬀerent gestures, generated by ﬁve diﬀerent persons,
was created through the use of a standard webcam. Focusing purely on
motion was realised by employing the non-linear dimensionality reduction
techniques Isometric Feature Mapping (Isomap) and t-Distributed Stochas-
tic Neighbor Embedding (t-SNE), to construct manifolds of videos. Man-
ifold alignment was enhanced by exploiting Fourier Descriptors and Pro-
crustes Analysis to solve rotation, translation, scaling and reﬂection of low-
dimensional mappings. Experiments demonstrated that t-SNE was unsuc-
cessful in recognizing gestures due to the non-convexity of its cost function.
However, combining Isomap and Fourier descriptors, the GM-method is very
successful in recognizing the dynamics of hand gestures in videos while solv-
ing the limitations of techniques focusing on frame analysis.
Contents
Contents i
List of ﬁgures iii
1 Introduction 1
1.1 The challenge of human-computer interaction . . . . . . . . . 1
1.2 Hand gestures for human-computer interaction . . . . . . . . 2
1.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 The Gesture-Manifold method . . . . . . . . . . . . . . . . . 4
1.5 Problem statement and research questions . . . . . . . . . . . 5
1.6 Outline of this thesis . . . . . . . . . . . . . . . . . . . . . . . 5
2 The Gesture-Manifold method 7
2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Isometric Feature Mapping . . . . . . . . . . . . . . . 9
2.2.2 t-Distributed Stochastic Neighbor Embedding . . . . . 13
2.2.3 Procrustes Analysis . . . . . . . . . . . . . . . . . . . 18
2.2.4 Elliptic Fourier Descriptors . . . . . . . . . . . . . . . 19
2.3 Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Methodology 21
3.1 Creation of the dataset . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Raw input . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Binary diﬀerence-frames . . . . . . . . . . . . . . . . . 23
3.2.3 Change-dependent diﬀerence-frames . . . . . . . . . . 24
3.2.4 Extracting skin color . . . . . . . . . . . . . . . . . . . 25
3.3 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Procrustes analysis . . . . . . . . . . . . . . . . . . . . 30
3.3.4 Elliptic Fourier Descriptors . . . . . . . . . . . . . . . 31
3.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 31
i
4 Experimental results 32
4.1 Classiﬁcation results . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Incorrectly classiﬁed gestures . . . . . . . . . . . . . . . . . . 36
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Conclusions and future research 40
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography 43
ii
List of ﬁgures
1.1 The three steps of the GM-method . . . . . . . . . . . . . . . 4
2.1 “Isomap correctly detects the dimensionality and separates
out the true underlying factors” [20]. . . . . . . . . . . . . . . 11
2.2 “The original isomap algorithm gives a qualitative organiza-
tion of images of gestures into axes of wrist rotation and ﬁnger
extension” [15]. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Plots of four techniques; t-SNE, Sammon Mapping, Isomap
and LLE, which cluster and visualize a set of 6.000 handwrit-
ten digits [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 The left plot shows two datasets, one depicted by red squares
and one depicted by blue circles. The right plot shows an
additional dataset, depicted by black x’s, representing the
blue dataset after applying Procrustes Analysis. . . . . . . . . 19
3.1 Two frames of the gestures in descending order; ‘click’, ‘cut’,
‘grab’, ‘paste’ and ‘move’ . . . . . . . . . . . . . . . . . . . . 22
3.2 Preprocessing a frame; graying and subsequently smoothing
the image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Two plots of the binary ‘diﬀerence-frames’ of the gesture ‘move’ 24
3.4 Two plots of change-dependent diﬀerence-frames of the ges-
ture ‘grab’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Two plots of skin color frames of the gesture ‘cut’ . . . . . . . 25
3.6 Two manifolds of the gesture ‘cut’ . . . . . . . . . . . . . . . 27
3.7 Two manifolds of the gesture ‘click’ . . . . . . . . . . . . . . . 28
3.8 Two additional manifolds of the gesture ‘cut’ . . . . . . . . . 28
3.9 The two manifolds of Figure 3.8 ﬂipped vertically . . . . . . . 29
3.10 Two low-dimensional mappings of the same video of the ges-
ture ‘click’, created by t-SNE . . . . . . . . . . . . . . . . . . 29
iii
4.1 Classiﬁcation percentages using raw frames as input for Isomap
with four approaches; raw Isomap (red, square), binary diﬀerence-
frames (blue, circle), change-dependent diﬀerence-frames (green,
x) and skin color frames (black, triangle). The left plot has
k-number of neighbors of the classiﬁcation method ranging
from 3 to 15, whereas the second plot has the k-number of
neighbors Isomap uses ranging from 10 to 25. . . . . . . . . . 33
4.2 Classiﬁcation percentages using Fourier descriptors as input
for Isomap with four approaches; raw Isomap (red, square),
binary diﬀerence-frames (blue, circle), change-dependent diﬀerence-
frames (green, x) and skin color frames (black, triangle).
The left plot has k-number of neighbors of the classiﬁcation
method ranging from 3 to 15, whereas the second plot has
the k-number of neighbors Isomap uses ranging from 10 to 25. 33
4.3 Classiﬁcation percentages using Procrustes analysis as input
for Isomap with four approaches; raw Isomap (red, square),
binary diﬀerence-frames (blue, circle), change-dependent diﬀerence-
frames (green, x) and skin color frames (black, triangle).
The left plot has k-number of neighbors of the classiﬁcation
method ranging from 3 to 15, whereas the second plot has
the k-number of neighbors Isomap uses ranging from 10 to 25. 34
4.4 Classiﬁcation percentages of t-SNE, while ranging the k-number
of neighbors of the classiﬁcation method when input diﬀers
from; raw frames (left plot), fourier descriptors (right plot),
procrustes analysis (bottom plot). Applied to t-SNE with
four approaches; raw t-SNE (red, square), binary diﬀerence-
frames (blue, circle), change-dependent diﬀerence-frames (green,
x) and skin color frames (black, triangle). . . . . . . . . . . . 35
iv
Chapter 1
Introduction
The best way to predict the future is to invent it.
Alan Kay
This chapter elucidates the advantages of intelligent human-computer inter-
action, recognizing hand gestures and related work. It is argued why it is
necessary for human-computer interaction to improve, and how recognizing
hand gestures can support its development. These matters are discussed in
Subsections 1.1 up to 1.3. A brief introduction to the proposed Gesture-
Manifold technique is subsequently presented in Subsection 1.4, whereas
Subsection 1.5 provides the problem statement and accompanying research
questions. Lastly, Subsection 1.6. provides an outline of this thesis.
1.1 The challenge of human-computer interaction
Thus far, human-computer interaction has not fundamentally changed for
nearly two decades. The WIMP (windows, icons, menus, pointers) paradigm,
together with the mouse and keyboard, has determined nearly the entire
way people use computers up till now. Users know exactly which actions
and commands are possible and which results they will yield. Although
the human hands are capable of the most diﬃcult tasks, they are solely
used for positioning and clicking the mouse or pressing keys. Compared
to communication between humans, this is a rather unnatural and limita-
tive way of interaction. Additionally, it forces the user to repeat the same
movement continuously, causing many people to obtain a Repetitive Strain
Injury (RSI).
As computers become increasingly important in life, it is highly desirable
that humans could communicate with computers in the same way they com-
municate with other humans [18]. Improving human-computer interaction
allows the user to communicate more naturally and work more eﬃciently
1
with the computer. One of the most relevant concepts of human-computer
interaction is ‘direct manipulation’ [21]. This implies that users communi-
cate directly with their objects of interest, instead of interacting through an
intermediary system. Although there have been several achievements in the
‘direct manipulation’ area of intelligent human-computer interaction, mainly
with respect to speech recognition and touch screens, the main population
is still limited to interacting with computers via keyboards and pointing de-
vices. Consequently, an increasing number of researchers in various areas of
computer science are developing technologies to add perceptual capabilities
to the human-computer interface. This promising interface is presented as
the Perceptual User Interface (PUI) [14], which deals with extending human-
computer interaction to use all modalities of human perception. When com-
pleted, this perceptual interface is likely to be the next major paradigm in
human-computer interaction. The most promising approach is the real-time
hand gesture recognition through the use of vision-based interfaces [14].
1.2 Hand gestures for human-computer interac-
tion
When humans communicate with each other, several non-verbal channels
are utilized to a large extent. These channels include facial expressions,
body language and hand gestures. They aid people in putting an extra em-
phasis on their emotions, feelings or viewpoints in an eﬃcient way, which
subsequently increases the chance of comprehension from the receiving end.
The hand gestures are universally used and are a crucial part of everyday
conversation, such as chatting, giving directions or having discussions. The
human hand is able to acquire an incredible number of clearly discernible
conﬁgurations, which is the main reason why sign language was developed.
This potential of the human hands is thus far not exploited in combination
with computers, although it is apparent that being able to recognize hand
gestures would signiﬁcantly improve human computer interaction. Addi-
tionally, a gesture recognition system could aid the deaf people using the
American Sign Language (ASL). A well functioning system could help them
to converse with non-signing people, without the need for an interpreter,
which increases their independence. Furthermore, the system could aid peo-
ple solely relying on sign language to communicate remotely with other
people.
1.3 Previous work
The complexity associated with recognizing hand gestures from videos is
incredibly large. An exceedingly large amount of data is to be analyzed
and processed, and great computational power is required. Therefore, most
2
attempts of recognizing hand gestures in the past have used devices, such
as instrumented gloves, to incorporate gestures into the interface [14]. For
example, the VPL Dataglove designed by Zimmerman [23] was the most
successfull glove before 1990. This glove used two optical ﬁbre sensors along
the back of each ﬁnger, indicating that ﬂexing a ﬁnger would bend the ﬁbres,
after which the light they transmitted could be measured. A processor
received this analog signal and was capable of computing the joint angles,
based on calibrations for each user. Special software was included such that
users could invent their own conﬁguration of joints and map it to their choice
of commands.
However, using gloves for gesture recognition basically has too many dis-
advantages. For instance, plugging in the necessary equipment and putting
gloves on and oﬀ takes time, in addition to the fact that the accuracy of
a glove possibly changes with every hand, as human hands are shaped in
many diﬀerent ways and sizes. Another important disadvantage is that a
glove severely limits the user’s range of motion, which is simply unnatural.
Finally, glovebased gestural interfaces often force the user to carry cables
which connect the device to a computer, which obstructs the ease and nat-
uralness with which the user normally interacts using hand gestures [13].
Therefore, researchers started developing vision-based systems to iden-
tify gestures and hand poses without the restrictions of gloves, using video
cameras and computer vision techniques to interpret the dynamic/static
data. Note that hand poses are quite diﬀerent from actual gestures [8]. A
hand pose is considered a static movement, such as a ﬁst in a certain posi-
tion or a ﬁnger extension. A gesture is a real dynamic movement containing
a starting point and ending point with a clear discernible diﬀerence between
them, such as waving goodbye or applauding. Very complex gestures include
ﬁnger movement, wrist movement and changes in the hand’s position and
orientation. These kind of gestures are heavily employed in the ASL.
Thus, several techniques strived to identify the hand postures whereas
other methods attempted to recognize the dynamic gestures. Recognizing
gestures using contour signatures of the hand in combination with Robust
Principal Component Analysis (RPCA) is very successful [14]. In [9] and [19]
gestures are assumed to be ‘doubly stochastic’ processes, which means they
are Markov processes whose internal states are not directly observable. Con-
sequently, in [9] Hidden Markov Models (HMM) were applied and it was
possible to recognize up to 14 diﬀerent gestures after showing only one or
two examples of each gesture. Another approach in [11] relies on an active
stereo sensor, using a structured light approach to obtain 3D information.
As recognizing gestures evidently is a pattern recognition problem, Neu-
ral Networks (NN) were successfully applied in [17] as well. Using these
techniques, the minimal recognition rate of distinct hand gestures is around
60–85% [3].
However, the majority of these techniques all have one focus in com-
3
mon, which is the recognition of static frames. Though they are successfully
able to recognize hand and/or ﬁnger positions in videos, they solely analyze
and process the static frames. The dynamics of hand gestures were easily
disregarded and the focus remained on image analysis [13]. However, ges-
tures are dynamic movements and the motion of hands may possibly convey
even more meaning than their posture. Using static frames severely restricts
the background of the user, as possible other objects in frames can reduce
the accuracy in identifying the hands. Another disadvantage is that diﬀerent
lighting conditions possibly aﬀect recognition results negatively as well. Ad-
ditionally, several gestures may contain the same hand postures on a certain
timestep, causing these techniques to correctly identify the hand posture
but recognizing the wrong gesture. Distance of the hand in the frames is
rather important for analyzing static frames as well. If the hand is too far
away in the frame, recognition will be more complex. Motion on the other
hand, is to a certain extent invariant to distance, as the motion of a gesture
remains the same however far away it happens.
Thus, more focus is necessary on the pure motion of the gestures, which
is thus far not exploited to its full potential. Recently, a similar approach to
this study is presented in [3], where Local Linear Embedding is applied to
successfully recognize the dynamics of hand gestures up to 93.2%, although
their gesture set consisted only of gestures with ﬁnger extensions. Thus, the
novelty of this study is recognizing hand gestures based purely on the dy-
namics of gestures by proposing a new technique called the Gesture-Manifold
method, which will be brieﬂy explained in the following Subsection.
1.4 The Gesture-Manifold method
This study proposes a new technique, called the Gesture-Manifold method
(GM-method), to recognize hand gestures in videos. The GM-method con-
tains three main steps, which are displayed in Figure 1.1.
Figure 1.1: The three steps of the GM-method
In preprocessing, the goal is to reduce background noise and obtain the
relevant regions of interest. Therefore, four diﬀerent approaches have been
applied for comparison. These approaches are; raw input, binary diﬀerence-
frames, change-dependent diﬀerence-frames and skin color frames, of which
explanations are given in detail in Chapter 3. Similarly, two diﬀerent non-
linear dimensionality reduction techniques, t-Distributed Stochastic Neigh-
bor Embedding (t-SNE) and Isometric Feature Mapping (Isomap), have
4
been implemented for manifold learning. These techniques are capable of
creating manifolds of videos, which represent the trajectories of frames in the
image space. Hence, these manifold are used to represent gestures. Explana-
tions on these non-linear dimensionality reduction techniques are provided
in Chapter 2. Additionally, two diﬀerent dataset-matching methods, Pro-
crustes Analysis and Fourier Descriptors, are applied for manifold matching
purposes. These methods are capable of eliminating the scaling, transla-
tional and rotational components of datasets, thus increasing the eﬃciency
of manifold alignment. Background theories of these methods are provided
in Chapter 2 as well. Finally, the GM-method uses a basic k-nearest neigh-
bor classiﬁcation method in the last phase.
1.5 Problem statement and research questions
Using the GM-method, this study strives to recognize hand gestures in
videos by focusing on the motion of the gestures. In preprocessing, four dif-
ferent approaches are applied for comparison and for manifold learning, two
diﬀerent non-linear dimensionality reduction techniques are implemented.
Additionally, two diﬀerent dataset-matching methods are applied for im-
proved manifold alignment. Consequently, this leads to the following prob-
lem statement and accompanying research questions:
To what extent is it possible to recognize hand gestures eﬀectively using the
GM-method?
• Which approach in preprocessing; raw input, binary diﬀerence-frames,
change-dependent diﬀerence-frames or skin color frames, is more eﬀec-
tive in eliminating background noise and obtaining regions of interest,
thus improving the construction of clearly discernible manifolds?
• Which non-linear dimensionalty reduction technique, t-SNE or Isomap,
is more eﬀective in creating quality manifolds of separate videos?
• Wich dataset matching method, Procrustes Analysis or Fourier De-
scriptors, is more eﬀective in aligning manifolds for improved recogni-
tion rates?
1.6 Outline of this thesis
The remainder of this thesis is structured as follows.
Chapter 2 summarizes the theoretical background of the techniques that
were applied throughout this thesis. Special emphasis will be put on Isomap
and t-SNE, with the intention of better comprehension of further chapters.
5
Chapter 3 explains the general approach regarding the GM-method. A con-
cise explanation of the dataset will be provided, in addition to ﬁgures of
certain hand gestures and their manifolds. The ﬁnal Subsection will provide
the evaluation criteria for the GM-method.
Chapter 4 presents the experiments performed during this thesis, and sta-
tistical information regarding the results. The last Subsection provides a
discussion concerning the applied methods and techniques.
Chapter 5 oﬀers further recommendations and concludes this thesis.
6
Chapter 2
The Gesture-Manifold
method
This chapter provides more detailed information on the background theory of
methods applied in the three main steps of the GM-method. Subsection 2.1
explains the preprocessing stage, whereas Subsection 2.2. provides details
on the non-linear dimensionality reduction techniques Isomap and t-SNE, in
addition to the dataset matching methods Procrustes Analysis and Fourier
descriptors. Finally, Subsection 2.3 provides a short explanation of the k-
nearest neigbor which is applied in the classiﬁcation stage.
2.1 Preprocessing
Clearly, it is not possible to feed Isomap whole video’s as input directly,
as memory limitations would not allow processing such incredibly high-
dimensional data. Firstly, it was necessary to read in the frames of the
video, and subsequently apply the appropriate preprocessing procedures.
As color in the video is not highly relevant since we’re primarily focusing on
motion, graying each frame of the video appeared a wise choice. Graying
these images would reduce the high-dimensional data signiﬁcantly, as the
gray version of a colored image is only one third of the data. Subsequently,
the grayed frames were normalized and smoothed, as smoothing the frames
reduces the variance between slight diﬀerences of similar images [1].
Four diﬀerent approaches in the preprocessing stage have been invented
during the development of the GM-method. Details on these approaches
are provided below.
1.Raw input
This ﬁrst approach is the most basic, as it solely involves graying and
smoothing the frames of the videos, and no additional preprocessing is per-
7
formed.
2.Binary diﬀerence-frames
This approach focuses on the motion of the hand in the frames, by con-
structing binary diﬀerence-frames. After graying and smoothing the orig-
inal frames, these binary diﬀerence-frames are created by computing dif-
ferences between subsequent frames. Using certain thresholds, pixels with
suﬃcient change between two subsequent frames will obtain a value of 0
(black) whereas pixels with insuﬃcient change obtain a value of 1 (white).
Consequently, binary diﬀerence-frames, having pixels with values of either
0 or 1, were constructed for each video.
3.Change-dependent diﬀerence-frames
This approach slightly enhances the previous binary diﬀerence-frames ap-
proach. It involves the same preprocessing procedures with the exception
that instead of giving pixels a value of either 0 or 1, it determines their value
by evaluating their rate of change. The higher the diﬀerence for a pixel is,
the lower value it obtains. In other words, if a pixel changes much between
two subsequent frames, this indicates it is a relevant pixel, and therefore will
obtain a higher gray-value.
4.Skin color frames
The human skin has an extraordinary color, which is often exploited when
attempting to identify human parts in images. Therefore, this approach
uses the skin color to obtain purely the hand features in the frames. Thus
instead of graying the frames, the red dimension of the RGB channels was
used to obtain only the pixels with the relevant skin color. A value between
0 and 1 was given to each pixel similar to the previous approach. Applying
this procedure to all frames, new skin color frames were constructed for each
video.
These approaches are further explained in detail in Chapter 3, including
illustrations of the resulting frames.
2.2 Manifold learning
Nowadays, computers become increasingly more important in our daily life,
being supported by an almost exponential increase of its computation speed
and memory capabilities each year. These enhancements open up new av-
enues of research, especially in image and video analysis, enabling scientists
to suddenly deal with large high-dimensional data sets that were previously
impossible to analyze within a lifetime. Therefore, they are frequently con-
fronted with the problem of dimensionality reduction; to ﬁnd meaningful
8
low-dimensional structures hidden in the high-dimensional data. Princi-
pal Component’s Analysis (PCA) and Multidimensional Scaling (MDS) are
examples of classical techniques for dimensionality reduction. These tech-
niques are easily implemented and guaranteed to discern the true structure
of data lying on or near a linear subspace of the high-dimensional input
space. MDS obtains an embedding which preserves the inter-point distances,
whereas PCA discovers the low-dimensional embedding of the data points
which preserves their variance as measured in the high-dimensional input
space. However, these linear techniques seek to keep the low-dimensional
representations of dissimilar data points far apart. Whereas for various high-
dimensional datasets, it is more relevant to ensure that the low-dimensional
representations of similar data points stay close together, which is generally
impossible with a linear mapping [10].
Thus, these approaches are not capable of discovering the essential non-
linear structures that occur in data of complex natural observations [20],
such as human handwriting or in this thesis, videos of hand gestures. Sub-
sequently, several non-linear dimensionality reduction techniques were de-
veloped in order to handle the non-linear degrees of freedom that underlie
high-dimensional datasets. Local Linear Embedding (LLE) [16], Isometric
feature mapping (Isomap) [20], and Stochastic Neighbor Embedding (SNE)
[4] are well-known examples of these non-linear dimensionality reduction
techniques. According to [1], Isomap is superior to LLE in preserving more
global relationships of data points. [10] provides an alternative to SNE, called
t-Distributed Stochastic Neighbor Embedding (t-SNE), able to outperform
the existing state-of-the-art techniques for data visualization and dimen-
sion reduction. Consequently, this study concerns the application of Isomap
and t-SNE to discover and visualize the non-linear nature of videos of hand
gestures. Subsections 2.1.1 and 2.2.2 provide respectively the theoretical
background of Isomap and t-SNE.
These non-linear dimensionality reduction techniques include processes
which are invariant to scale, translation and rotation. Consequently, the
constructed manifolds are essentially similar but appear dissimilar when
visualized. Therefore, two dataset matching methods, Procrustes Analysis
and Fourier descriptors, are implemented in the manifold learning phase
to improve manifold alignment. Subsections 2.2.3 and 2.2.4 respectively
explain the theoretical background of these methods.
2.2.1 Isometric Feature Mapping
In image processing, dimensionality reduction techniques strive to represent
each image as a point in the low-dimensional space. For videos, this means
the set of frames are represented as a set of points, which together deﬁne
the image space of the video. Isometric feature mapping (Isomap) considers
a video sequence as a collection of unordered images which deﬁne an image
9
space, and a trajectory through that image space is deﬁned by an ordering
of those images [15], which is typically called a manifold. Thus, for every
ordering of the set of images, Isomap is able to create a diﬀerent manifold.
This concept is quite relevant in this study, which Chapter 3 will clarify in
detail.
Isomap was developed by J. B. Tenenbaum, V. de Silva and J.C. Langford
in Stanford in the year 2000. In [20] they published their new method and
its results, and thus the following explanation on Isomap references several
functions and ﬁgures from their article. Basically, the full Isomap algorithm
consists of three steps; construct a neighborhood graph, compute the short-
est paths and use Multidimensional scaling to visualize the low-dimensional
mapping. The details of these three steps will now be explained separately.
Constructing a neighborhood graph
Firstly, Isomap creates a weighted graph G of the neighborhood relations,
based on the distances dX (i, j) between pairs of data points i, j in the in-
put space X. These distances can either be determined by computing the
distances of each point to its k-nearest neighbors, or the distance of each
point to all other points with a ﬁxed radius e. Consequently, the graph G
has edges of weight dX (i, j) between neighboring points.
Compute shortest paths
In this step, Isomap computes the shortest paths dG (i, j) of the points on
the manifold M by estimating the geodesic distances dM (i, j) between all
pairs of points. Generally, Dijkstra’s algorithm [2] is applied as a shortest
path algorithm.
Multidimensional scaling
After the shortest paths are computed, the last step concerns applying MDS
to the matrix of graph distances DG = dG (i, j). MDS will construct an
embedding of the data in a d-dimension Euclidean space Y that maintains
the manifold’s intrinsic geometry optimally. Coordinate vectors yi of the
points in Y are determined to minimize the cost function
E = τ (DG ) − τ (DY ) L2 , (2.1)
where DY signiﬁes the matrix of Euclidean distances dY (i, j) = yi − yj
and A L2 denotes the L2 matrix norm Σi, jA2 i, j. The τ operator en-
sures eﬃcient optimization by converting distances to inner products which
distinctively characterizes the geometry of the data. To achieve the global
minimum of Eq. 2.1 it is necessary to set the coordinates yi to the top d
eigenvectors of the matrix τ (DG ). As the dimensionality of Y increases, the
decrease in error will show the true dimensionality of the data.
10
Two examples are shown below, to give a general idea on how Isomap rep-
resents high-dimensional data of images as points in the low-dimensional
space. Figure 2.1 presents Isomap applied on a set of synthetic face images
having three degrees of freedom. Figure 2.2 shows the result of applying
Isomap on a set of noise real images of a human hand, which varies in wrist
rotation and ﬁnger extension.
Figure 2.1: “Isomap correctly detects the dimensionality and separates
out the true underlying factors” [20].
In these ﬁgures, each data point represents one image. To show how the im-
age space is mapped according to the angle/axes, depending on the dataset,
several original images are plotted in the ﬁgure itself next to the data point
by which it is represented. With the aid of these additional images, it is
quite obvious that Isomap captures the data’s perceptually relevant struc-
ture.
When the number of data points increase, the graph distances dG (i, j) return
progressively more accurate estimations to the intrinsic geodesic distances
dM (i, j). Several parameters of the manifold such as branch separation and
radius of curvature, in addition to the density of the points, determine how
11
Figure 2.2: “The original isomap algorithm gives a qualitative organization
of images of gestures into axes of wrist rotation and ﬁnger extension” [15].
quickly dG (i, j) converges to dM (i, j). This proof guarantees that Isomap
asymptotically recovers the true dimensionality and intrinsic geometry of a
larger class of non-linear manifolds, even when the geometry of these man-
ifolds are highly folded or twisted in the high-dimensional space. For the
non-Euclidean manifolds, Isomap is still able to provide a globally optimal
Euclidean representation in the low-dimensional space.
Though there have been prior attempts to extend PCA and MDS to an-
alyze non-linear data sets, Isomap was the ﬁrst method to overcome their
major limitations. Local linear techniques [16] were unable to represent high-
dimensional datasets with a single-coordinate system, such as Figure 2.1
and 2.2 show. Other techniques that are based on greedy optimization
procedures lack the eﬀective advantages Isomap gains from PCA and MDS,
which are: a non-iterative polynomial time procedure while ensuring a global
optimality, a asymptotic convergence to the true structure of Euclidean man-
ifolds and the ability to to deal with any dimensionality in contrast to a ﬁxed
dimensionality.
12
2.2.2 t-Distributed Stochastic Neighbor Embedding
For visualizing high-dimensional data, several techniques have been devel-
oped in the last few decades. For example, Chernoﬀ-faces [12] provides
iconographic displays, relating data to facial features in order to improve
data digestion, whereas other methods attempt to represent data dimensions
as vertices in graphs [10]. However, the majority of these techniques merely
provide tools to visualize the data on a lower-dimensional level and lack
any analyzing capabilities. Thus, these techniques may be useful on a small
class of datasets, but are mainly not applicable on a large class of real-world
datasets which contain thousands of high-dimensional data points. There-
fore, several dimensionality reduction techniques have been developed, as
described in the introduction of this chapter. These techniques are highly
successful in reducing the dimensionality while preserving the local struc-
ture of the data, but often lack the capability to visualize their result in
a comprehensible manner. Consequently, a technique which could capture
the local structure of high-dimensional data successfully in addition to an
intelligent visualizing capability was yet to be developed. [10] claims to have
developed such a technique, building on the original Stochastic Neighbor
Embedding (SNE) [4]. In [10], the new technique t-Distributed Stochastic
Neighbor Embedding (t-SNE) is tested against seven other state-of-the-art
non-linear dimensionality reduction techniques, including Isomap, where t-
SNE clearly outperforms each of them. This technique will now brieﬂy be
explained, starting with the original technique SNE, followed by the exten-
sion to t-SNE and ending with conclusions. The equations that are presented
in the remainder of this Subsection are largely based on [10].
Stochastic Neighbor Embedding
The algorithm starts by computing the asymmetric conditional probabil-
ity pj|i to model similarities of each datapoint xi and datapoint xj . This
probability represents the likelihood that point xi would select point xj as
its neighbor, under the condition that neighbors are picked in proportion
to their probability density under a Gaussian centered at xi . Thus, for
datapoints far apart pj|i will be small, whereas it will be large for nearby
datapoints. The probability pj|i is mathematically computed by
2 2
exp(− xi − xj /2σi )
pj|i = 2 2
, (2.2)
Σk=i exp(− xi − xk /2σi )
where σi represents the Gaussian centered at xi and k is the eﬀective number
of neighbors, generally called ‘perplexity’. The value of σi can either be set
by hand or found through a binary search for the value of σi that ensures
that the entropy of the distribution over the neighbors is equal to log k. As
13
the density of data varies, an optimal value of σi is unlikely to exist, causing
the binary search to be the best way to obtain the value of σi . For the
low-dimensional datapoints yi and yj which represent the high-dimensional
datapoints xi and xj , a similar conditional probability, qj|i , is computed.
2
The equation to compute qj|i is similar to Eq. 2.2, except that σi is ﬁxed at
1
a value of √2 . Thus, qj|i is mathematically given by
2
exp(− yi − yj )
qj|i = , (2.3)
Σk=i exp(− yi − yk 2 )
Clearly, a perfect low-dimensional representation would guarantee that pj|i
and qj|i have the same value for all datapoints. Consequently, SNE strives
to minimize the divergence between these values through the use of a cost
function. The Kullback-Leibler divergence is a measure generally used in
such a case. Therefore, the resulting cost function C is given by
pj|i
C= KL(Pi ||Qi ) = pj|i log , (2.4)
qj|i
i i j
where Pi stands for the conditional probability distribution over all points xi
and xj , whereas Qi represents the conditional probability distribution over
all datapoints yi and yj . This cost function ensures that nearby datapoints
stay nearby and widely separated data points stay far apart, thus preserving
the local structure of the data.
To minimize the cost function of Eq. 2.4, a gradient descent method is
utilized, given by
δC
=2 (pj|i − qj|i + pi|j − qi|j )(yi − yj ). (2.5)
δyi
j
This equation shows that yi will either be pulled towards or pushed away
from yj , depending essentially on how often j is perceived to be a neighbor.
The gradient descent concerns two additional procedures. The ﬁrst is adding
random Gaussian noise to the the map points after each iteration. Decreas-
ing this amount of noise with time aids the optimization in ﬁnding better
local optima. SNE commonly obtains maps with a better global organiza-
tion when the variance of the noise changes very slowly at the critical point
where the global structure of the map starts to form. The second procedure
involves adding a relatively large momentum to the gradient. Thus, at each
iteration of the gradient search, the changes in the coordination of the map
points are determined by adding the current gradient to an exponentially
decaying sum of earlier gradients. This procedure aids in speeding up the
optimization and escaping poor local minima. However, these two proce-
dures bring along certain risks. For example, how to determine the amount
14
of noise and the rate at which it decreases is quite complicated. In addition,
these two values aﬀect the amount of momentum and the step size involved
in the gradient descent and vice versa. Consequently, it is not unusual to
run the optimization several times to discover the proper values of these
parameters.
t-Distributed Stochastic Neighbor Embedding
This algorithm diﬀers from SNE in several ways. Firstly, t-SNE uses a sym-
metrized version of the cost function. Secondly, where SNE uses a Gaussian
distribution to compute similiarities between points in the low-dimensional
space, t-SNE employs a Student-t distribution. These variations will now
be explained respectively.
Symmetry
SNE computes the conditional probabilities pj|i and qj|i in an asymmetric
manner. Computing these in a symmetric way implies that pj|i = pi|j and
qj|i = qi|j . This can be achieved by minimizing a single Kullback-Leibler di-
vergence between the joint probabilities pij and qij rather than minimizing
the sum between these probabilities. Subsequently, the equations involved
in this process are
2 2
exp(− xi − xj /2σi )
pij = 2 2
, (2.6)
Σk=l exp(− xk − xl /2σi )
2
exp(− yi − yj )
qij = , (2.7)
Σk=l exp(− yk − yl 2 )
where pj|i = pi|j and qj|i = qi|j for all points i and j. The cost function C
for this symmetric SNE is then given by
pij
C = KL(P ||Q) = pij log . (2.8)
qij
i j
The main advantage of this symmetrized version of SNE is the more simple
form of the gradient, which decreases the overall computation time. This
gradient is given by
δC
=2 (pij − qij )(yi − yj ). (2.9)
δyi
j
Student-t distribution
In various datasets, visualizing the data on a low-dimensional level brings
15
along a certain ‘crowding problem’ [10], which occurs not only when apply-
ing SNE but also when using other techniques for multidimensional scaling.
This crowding problem represents the problem that the area of the two-
dimensional map which is able to ﬁt the reasonably distant data points is
not nearly large enough to contain all the nearby datapoints. Thus, to map
the small distances truthfully, most of the large number of points which have
a reasonable distance from datapoint i are to be positioned too far away in
the map. As a consequence, the connections between datapoint i to each of
these reasonably far away datapoints will obtain a small attraction. Though
these attraction values are rather small, the sheer number of them causes the
points to be squeezed together in the centre of the map, which ensures that
there is a lack of space for the gaps that usually form between the natural
clusters. In [5] a solution concerning a slight repulsion was presented. This
repulsion involved producing a uniform background having a small mixing
ρ
proportion ρ. Thus, qij could never fall below n(n−1) , regardless of how far
away two datapoints were. This method, called UNI-SNE, generally out-
performs SNE, but brings along a tedious optimization process of its cost
function. Directly optimizing the cost function of UNI-SNE is impossible
as two datapoints that are far apart will obtain their qij more or less com-
pletely from the uniform background. Thus, if separate parts of one cluster
are divided at the start of the optimization, there will not be enough force
to pull them back together.
In t-SNE, a quite simple solution to the crowding problem is presented.
The symmetric SNE compares the joint probabilities of datapoints instead
of the distances between them. In the high-dimensional space, these proba-
bilities are computed through the use of a Gaussian distribution. However,
in the low-dimensional map, these probabilities are computed by employing
a probability distribution with much heavier tails than a Gaussion distribu-
tion. As a consequence, any unwanted attractive forces between dissimilar
datapoints are removed. Thus, reasonably-distant data points can be truth-
fully mapped in the low-dimensional space. The Student-t distribution with
one degree of freedom is the heavy-tailed distribution employed in t-SNE,
which adjusts the equation of computing qij to
2 −1
(1 + yi − yj )
qij = . (2.10)
Σk = l(1 + yk − yl 2 )−1
The one degree of freedom ensures that the representation of joint proba-
bilites in the lower-dimensional map are more or less invariant to changes in
the scale of the map for map points that are widely separated. An additional
advantage of using the Student-t distribution is that estimating the density
of a datapoint involves much less computation time, as this distribution does
not entail an exponential like the Gaussian distribution. The ﬁnal gradient
using the Kullback-Leibler divergence between P , from Eq. 2.6, and the
16
Student-t based joint probability distribution Q, from Eq. 2.9, is given by
δC 2 −1
=4 (pij − qij )(yi − yj )(1 + yi − yj ) . (2.11)
δyi
j
Using this gradient search, t-SNE ensures that dissimilar datapoints are
modeled via large pairwise distances and small datapoints are modeled via
small pairwise distances. Additionally, optimizing the cost function of t-SNE
is much faster and more uncomplicated than optimizing the cost functions
of SNE and UNI-SNE.
Figure 2.3 shows an illustration from [10] of four diﬀerent techniques
clustering and visualizing high-dimensional data of handwritten digits. The
ﬁgure demonstrates how t-SNE clearly outperforms the other methods.
Figure 2.3: Plots of four techniques; t-SNE, Sammon Mapping, Isomap
and LLE, which cluster and visualize a set of 6.000 handwritten digits [10].
However, even though t-SNE appears to outperform every state-of-the-art
technique, it has three main weaknesses. The ﬁrst ﬂaw is the non-convexity
17
of the cost function. This indicates that it is required to decide on values for
several parameters for the optimization. The produced mappings depend
on these parameters and might be dissimilar at every run.
The second weakness is that t-SNE is applied especially for data visu-
alization, and it is uncertain yet whether applying the technique to reduce
dimensions of datasets to d > 3 dimensions, thus for other purposes than
visualization, will provide excellent results as well.
The ﬁnal imperfection of t-SNE is the curse of intrinsic dimensionality,
which other manifold learners as LLE and Isomap suﬀer from as well. As the
reduction of dimensionality is mainly based on local properties of the data,
results will be less successful on datasets with a high intrinsic dimensionality.
However, despite these ﬂaws, t-SNE is still an excellent state-of-the-art tech-
nique capable of retaining the local structure of the data while visualizing
the relevant global structure of the data.
2.2.3 Procrustes Analysis
The Procrustes analysis is generally used for analyzing the distribution of
a set of shapes. In addition, it is often applied to remove the translation,
scaling and rotation components from datasets. Similar datasets that have
diﬀerent scaling components or are translated can still be matched through
the use of this method.
The translational component is removed by translating the dataset such
that the mean of all the datapoints is centered at the origin. Similarly,
by scaling the dataset such that the sum of the squared distances from
the datapoints to the origin is 1, the scaling component is removed. To
remove the rotational component, one of the two datasets is selected as a
reference point to which the other dataset is required to conform. Consider
the two datasets (xi , yi ) and (wj , zj ), where the dataset (wj , zj ) is required
to adjust to the dataset (xi , yi ). Rotating by the angle θ gives (ui , vi ) =
(cosθwj − sinθzj , sinθwj + cosθzj ). Subsequently, the Procrustes distance
is given, as in [22], by
d= (u1 − x1 )2 + (v1 − y1 )2 + .... (2.12)
Figure 2.4 provides an example of two almost similar datasets with diﬀerent
rotation, scaling and translation components. The right plot shows the
original datasets in addition to the result of applying the Procrustes analysis
such that the second dataset is rotated to match the ﬁrst dataset. The result
is excellent as the second dataset is almost matching the entire ﬁrst dataset.
18
Figure 2.4: The left plot shows two datasets, one depicted by red squares
and one depicted by blue circles. The right plot shows an additional dataset,
depicted by black x’s, representing the blue dataset after applying Procrustes
Analysis.
2.2.4 Elliptic Fourier Descriptors
Ellipctic Fourier descriptors are introduced by Kuhl and Giardina in [7]
and are generally applied to describe the shape of objects found in images.
Their shape description is independent of the relative size and position of the
object in the image, since the descriptors are invariant to scale, translation
and rotation. Generally, elliptic Fourier descriptors are used to describe a
closed curve, but they can be applied to open-ended curves, such as the
manifolds of videos, as well. Mathematically, a curve (xi , yi ) parameterized
by 0 ≤ t ≤ 2π is expressed as a weighted sum of the Fourier basis functions
[6]:
      
x(t) a0 ∞ ak bk cos kt
 = +    (2.13)
y(t) c0 k=1 ck dk sin kt
The coeﬃcients in closed form are given by
1 2π 1 2π
a0 = 2π 0 x(t)dt c0 = 2π 0 y(t)dt
1 2π 1 2π
ak = π 0 x(t)cos kt dt bk = π 0 x(t)cos kt dt (2.14)
1 2π 1 2π
ck = π 0 y(t)cos kt dt dk = π 0 y(t)sin kt dt
Consequently, the curve (xi , yi ) is described by a0 , c0 , a1 , b1 , c1 , d1 , .... In
other words, the curve is described in terms of its angles and slopes, which
removes the scaling and translational components. By subsequently taking
the absolute values of the descriptors, it becomes irrelevant whether slopes go
19
up or down, which essentially removes the rotational/reﬂectional component
of datasets.
2.3 Classiﬁcation
In the ﬁnal classiﬁcation step, a k-nearest neighbor method is applied. This
technique basically determines the k-nearest neighbors of the test object,
and classiﬁes the object according to the majority vote of these k-nearest
neighbors. For manifolds, this indicates that a distance matrix of the test
manifold and the database is created, after which the k-nearest neighbors
are determined. Consequently, it is classiﬁed as the gesture which holds the
majority vote of these neighbors.
20
Chapter 3
Methodology
This chapter focuses on the experimental setup of the GM-method. Chapter
1 clariﬁed that many diﬀerent approaches and techniques have been applied
for comparison, and the implementations of these methods will be explained
in this chapter. Subsection 3.1 will provide details on the creation and
development of the dataset. Explanations of two main steps of the GM-
method, preprocessing and manifold learning, are provided respectively in
Subsections 3.2 and 3.3. Details on the classiﬁcation step of the GM-method
are provided in Chapter 2 and requires no further explanations. Finally,
Subsection 3.4 presents the evaluation criteria of the GM-method.
3.1 Creation of the dataset
Databases of videos of hand gestures are unfortunately not publicly avail-
able. Several videos of the people using the American Sign Language (ASL)
exist online, but these are not suﬃcient to create an entire dataset. There-
fore, a new dataset was created using a webcam combined with a white wall
as background. Additional videos comprising a more detailed background
have been recorded as well for further experiments, which is explained in
Subsection 4.2. Keeping in mind that the goal of this study is that people
can use a ﬁnal version of this program to input commands to their comput-
ers through hand gestures, a standard webcam with a resolution of 320 x
240 recording at a speed of 30 frames per second was used. A set of ﬁve
diﬀerent hand gestures was created, based on diﬀerences in wrist rotation,
movement and ﬁnger extensions. Illustrations of each of these hand gestures
are depicted in Figure 3.1. Clearly, any computer command may be asso-
ciated with each of these gestures, thus their names are suggested in this
study merely for easier comprehension.
Five diﬀerent persons were asked to perform ten of the in Figure 3.1
presented hand gestures each, to ensure the GM-method is largely invariant
to diﬀerent shapes of hands. These test persons were shown one example
21
Figure 3.1: Two frames of the gestures in descending order; ‘click’, ‘cut’,
‘grab’, ‘paste’ and ‘move’
of each hand gesture beforehand, and subsequently asked to imitate this
example as closely as possible in front of the webcam. Thus in total, each
person performed 50 hand gestures. Afterwards, the ﬁve gestures out of
the ten attempts which appeared most similar to the shown example were
selected, for each diﬀerent gesture. Altogether, the number of selected videos
was 5 persons x 5 attempts x 5 gestures = 125 videos. Note that the videos
of each separate gesture was cut out of the main video containing the ten
attempts. Therefore, the videos contained as closely as possible only the
start of the gesture until the end of the gesture. However, cutting sequences
out of a video is a delicate procedure, which resulted in videos containing
only the gesture itself, but not being aligned in time. For instance, one
video of the gesture ‘click’ could have the ﬁnger moving at frame 10, whereas
another video had the ﬁnger moving at frame 20. For classiﬁcation purposes,
22
this concept will be further discussed in Subsection 4.3.
3.2 Preprocessing
To eliminate noise in the frames of the videos of hand gestures, it is most
desired to extract only the hand from the frames. This feat can be achieved
by computing the diﬀerences between the frames to locate the relevant pixels
which essentially represent the motion in the video. Clearly this method
is based on the assumption that only the hand is moving in the videos.
Another method involves extracting only the color of the skin from the
frames to eliminate the background. As Chapter 2 explained, four diﬀerent
approaches were implemented in the preprocessing stage. The ﬁrst approach
is explained in Subsection 3.2.1, whereas the second approach regarding the
computation of diﬀerences will be clariﬁed in Subsection 3.2.2. Details on
the change-dependent diﬀerence-frames are provided in Subsection 3.2.3,
whereas the approach concerning skin color is elucidated in Subsection 3.2.4.
3.2.1 Raw input
As explained in Chapter 2, the raw input approach involved graying, nor-
malizing and smoothing the frames of each video, which resulted in a matrix
of 320 x 240 for each frame. Afterwards, the matrices of the frames were
converted into a vector, through positioning the rows of the matrix behind
each other. Thus, converting a matrix of 320 x 240 produces a vector of 1
x 76800. For example, the largest video of the dataset contained 90 frames.
Consequently, this video was processed into a matrix of 90 x 76800. Fig-
ure 3.2 provides an illustration of the results of graying and smoothing a
frame of a video from the dataset.
Figure 3.2: Preprocessing a frame; graying and subsequently smoothing
the image
3.2.2 Binary diﬀerence-frames
The pixels that have diﬀerent values in subsequent frames would suggest
motion. Other pixels would indicate only background or noise and could be
eliminated. Thus, for each two subsequent frames, a ‘diﬀerence-frame’ was
23
created, using only the pixels that changed. A certain threshold was neces-
sary to determine when the change between pixels would be large enough
to allow these certain pixels to obtain relevance. In addition, an extra
threshold was implemented to determine if there were enough relevant pixels
that changed suﬃciently according to the ﬁrst threshold. Thus, the second
threshold determined whether diﬀerence-frames were important enough to
use. Clearly, having 30 frames every second, several frames appear very
similar and might not contain any motion, rendering them quite irrelevant.
These thresholds were both determined through observation when ex-
perimenting with several videos. The ﬁrst threshold to determine if the
diﬀerence between pixels was suﬃcient was set to a value of 0.10. The sec-
ond threshold which decides whether a frame was relevant depending on the
amount of pixels that changed was set to a value of 300. However, further re-
search showed that several video’s either lacked suﬃcient change or changed
excessively. This resulted in video’s having either no diﬀerence-frames at
all or too many diﬀerence-frames having too many pixels changing, thus
still retaining background noise. Therefore, a search algorithm was imple-
mented which determined for every video separately the ideal thresholds.
This algorithm ensured a minimum of 10 frames, to at least represent the
gesture correctly. A maximum of 25 frames was set as well, to guarantee
an acceptable reduction of background noise. The pixels that changed suf-
ﬁciently according to the ﬁrst threshold were set to a value of 0 (thus, a
black pixel), whereas pixels with insuﬃcient change were set to a value of
1 (a white pixel). Thus, the diﬀerence-frames that were created for each
video were in fact binary images, consisting only of values of either 0 or 1.
Figure 3.3 provides an example of plots of these diﬀerence-frames for the
gesture ‘move’. These binary diﬀerence-frames were subsequently used as
input in Isomap/t-SNE, instead of the regular grayed and smoothed frames.
Figure 3.3: Two plots of the binary ‘diﬀerence-frames’ of the gesture ‘move’
3.2.3 Change-dependent diﬀerence-frames
Research revealed that several of the binary diﬀerence-frames still contained
many irrelevant black pixels, which barely passed the requirement of the ﬁrst
24
threshold. Thus, to enhance the diﬀerence-frame approach, it was necessary
to replace the binary frames with regular non-binary images. Rather than
giving pixels either a value of 0 or 1 depending on whether they passed the
threshold, their values would depend on their rate of change. Consequently,
irrelevant pixels would obtain a lesser gray-value while more relevant pixels
would acquire a higher gray-value. Thus, images were converted from binary
images into normal gray images, having pixels depending on the amount they
essentially changed in subsequent frames. Figure 3.4 presents two plots of
these diﬀerence-frames for the gestures ‘grab’, to show the diﬀerence between
diﬀerence-binary-frames and change-dependent diﬀerence-frames. The plots
clearly show diﬀerences between the gray-values of pixels.
Figure 3.4: Two plots of change-dependent diﬀerence-frames of the gesture
‘grab’
3.2.4 Extracting skin color
This approach involves extracting the skin color from the frames in order to
reduce the background noise. As the background is a white wall, the RGB
channels could be used eﬃciently to extract only features of the hand/arm.
Figure 3.5: Two plots of skin color frames of the gesture ‘cut’
The red channel of the RGB channels contains nearly all hand pixels and is
25
suﬃcient to extract skin color. Similar to diﬀerence-frames, a threshold was
determined to allow pixels to gain relevance or not, based on their level of
redness. Figure 3.5 provides an example with two illustrations of frames of
the gesture ‘cut’, preprocessed with this method.
3.3 Manifold learning
The most relevant feature and novelty of this method is that it concerns
identiﬁcation of hand gestures based solely on the motion of the gesture.
In other words, where other techniques classify certain relevant frames of
the video, this approach classiﬁes the entire trajectory of the frames in the
image space. Dimensionality reduction techniques like Isomap, LLE and t-
SNE appear to be quite suitable for such an approach, as these methods are
capable of producing a d-dimensional manifold of videos. These constructed
manifolds represent the trajectory of an ordering of images in the image
space. In other words, they represent the ordering of frames of a video.
After preprocessing, the videos were prepared to serve as input for a non-
linear dimensionality reduction technique. Normally, in Isomap and t-SNE,
it is common to use a matrix containing all the videos of all gestures as
the input matrix. This way, all the frames of all the videos could form
the image space, and by knowing for each two-dimensional point in the
mapping which frame in which gesture it represented, it would be possible
to generate trajectory’s through that image space. When a new video would
require classiﬁcation, each frame of that video could be classiﬁed in the image
space, resulting in a correct identiﬁcation of the gesture of the new video.
However, using this general procedure, it would mean static images would
be classiﬁed, whereas the focus of this thesis is classifying purely the motion
of a gesture. Therefore, instead of using all the videos as one input for a
non-linear dimensionality reduction technique, every video was separately
used as input. Thus, for every video, a separate manifold was constructed,
assuming manifolds of the same gesture would appear similar. Subsections
3.3.1 and 3.3.2 provide explanations on the implementation of respectively
Isomap and t-SNE. As Chapter 2 explained and illustrations in Subsection
3.3.1 will demonstrate, additional dataset matching methods were required
to improve manifold alignment. These methods include Procrustes Analysis
and Fourier descriptors, which will be explained respectively in Subsections
3.3.3 and 3.3.4.
3.3.1 Isomap
Isomap requires a matrix with rows as datapoints and columns as dimen-
sions. Thus, rows would be the frames of the video, whereas the number
of dimensions would be 76800. Additionally, Isomap requires two diﬀerent
parameters; the dimension d the input matrix should be reduced to, and
26
the k-number of neighbors it should use. In [1] top results were achieved
using a dimension of 2, which is basically the default dimension as well. For
the k-number of neighbors, results generally vary depending on the dataset.
Thus, the dimension was set to 2, and manifolds were created for k-number
of neighbors ranging from 10 to 25.
However, several complications surfaced when processing videos of dif-
ferent length. Saving all the diﬀerent-length manifolds of the same gesture
in one matrix is incredibly complex, and comparing these manifolds of dif-
ferent lengths would be problematical as well. Therefore, in [1] interpolating
the low-dimensional mappings is presented as a solution for manifolds of dif-
ferent length. Multiplying the number of frames of the longest video times
two was used as the standard number of frames for each video. Thus, every
manifold that was created using Isomap was directly interpolated to that
standard value, which was in this study a value of 180. As a consequence,
Isomap returned the low-dimensional mappings in the form of a matrix of
2 x 180, for each video. Figure 3.6 presents plots of two manifolds of the
gesture ‘cut’, whereas Figure 3.7 shows plots of two manifolds of the gesture
‘move’. The manifold itself is only two-dimensional but the ﬁgures contain
an additional axis. The cause is the reintroduction of time, which is rep-
resented by the x-axis. Reintroducing time produces a clearer view of the
trajectory of the frames in time.
Figure 3.6: Two manifolds of the gesture ‘cut’
Clearly these plots demonstrate the fact that the manifolds of the same
gesture appear similar, whereas they diﬀer much when comparing them to
manifolds of the other gesture. However, Figure 3.8 provides a plot of two
manifolds of the same gesture ‘cut’ as in Figure 3.6.
27
Figure 3.7: Two manifolds of the gesture ‘click’
Figure 3.8: Two additional manifolds of the gesture ‘cut’
The manifolds of ﬁgure 3.8 seem comparable, but they do not appear sim-
ilar to the manifolds of the same gesture in Figure 3.6. However, through
observation it is quite noticable they essentially do appear similar, but they
are simply ﬂipped vertically. Figure 3.9 shows the same plots in Figure 3.8
ﬂipped vertically, which demonstrates the ﬂipped manifolds actually do ap-
pear similar to the other manifolds of the gesture ‘cut’.
These rotations are caused by the Multidimensional scaling in Isomap’s
algorithm. MDS ensures a correctly looking manifold in terms of distances
between datapoints. However, as the method is purely based on these dis-
tances, it is insensitive to rotation, translation and reﬂection. Matching
these rotated manifolds with not-rotated manifolds proved quite compli-
cated, as the values of the datapoints are quite divergent.
3.3.2 t-SNE
The previous Subsections explain preprocessing the videos and subsequently
applying Isomap. In order to compare two non-linear dimensionality reduc-
tion techniques, the t-SNE technique was incorporated in this study as well.
28
Figure 3.9: The two manifolds of Figure 3.8 ﬂipped vertically
This method requires four input parameters, of which the ﬁrst one is the
basic dataset with rows as datapoints and columns for dimensions. The
second and third parameters specify respectively the number of ﬁnal dimen-
sions the dataset should be reduced to, and the number of dimensions the
Principal Component’s Analysis in the ﬁrst part of t-SNE should reduce the
dataset to. The ﬁnal number of dimensions was set to 2, which was the same
value selected in Isomap. For the initial number of dimensions for PCA the
default value of 30 was used. The fourth parameter indicates the perplexity,
which essentially is the k-number of neighbours. Experiments showed that
ranging the perplexity had no inﬂuence on results, thus it was set to the
default value of 30. As in Isomap, resulting mappings were interpolated to
obtain a two-dimensional vector of 2 x 180 for each video.
Figure 3.10: Two low-dimensional mappings of the same video of the
gesture ‘click’, created by t-SNE
Examples of resulting plots of the gesture ‘click’ are provided in Fig-
ure 3.10. These plots show two very dissimilar manifolds, although these
are in fact plots of applying t-SNE to one video. Thus, t-SNE returns two
29
completely diﬀerent mappings for exactly the same video. The cause is the
non-convexity of its cost function, which is explained in Chapter 2 as a weak-
ness of t-SNE. Due to the optimization process, the error is often diﬀerent
in every run, resulting in diﬀerent mappings every time. Clearly, this in-
ﬂuences the classiﬁcation results negatively. Low-dimensional mappings of
the same gesture were generally dissimilar, whereas Isomap produced very
similar manifolds. Chapter 4 will present the experimental results using the
t-SNE technique.
3.3.3 Procrustes analysis
Subsection 3.3.1 shows plots of rotated manifolds caused by Multidimen-
sional scaling. Although the manifolds are very similar when visualized
correctly, rotational components complicate the classiﬁcation of gestures
greatly. Fortunately, there exist several techniques to solve the diﬀerent ro-
tation, translations and scaling of similar datasets, such as the Procrustes
Analysis.
The Procrustes analysis requires two input matrices. The ﬁrst matrix
concerns the dataset which stays ﬁxed, whereas the second matrix represents
the dataset which is to be rotated, scaled and translated to match the ﬁrst
dataset. The output consists of the altered second dataset, in addition to
a dissimilarity value. This value between 0 and 1 represents how much
the input datasets are similar to each other. For example, if the returned
dissimilarity value is 1, there is no similarity at all and using the Procrustes
analysis is futile.
As the ﬁrst input is ﬁxed, it means that the ﬁrst matrix is a reference
point, to which all other matrices, depending on the size of the dataset,
are rotated, scaled and translated. Thus, for each gesture, one of the 25
videos needed to serve as a reference dataset, to which all the other videos
should match their matrix using the Procrustes analysis. The dissimilarity
value output was rather useful in this process. A search algorithm was
implemented to discover the video which served best as a reference point
for the other videos. This search ensured each video was the reference point
at least one time, while continuously computing the dissimilarity values
between all the videos and the reference dataset. Consequently, the video
having the minimum sum of all the dissimilarity values, thus the manifold
that appeared most similar to all other manifolds, was most suitable to
serve as the reference matrix. For each gesture such a reference matrix
was determined, after which all the other manifolds were changed using the
implementation of the Procrustes Analysis.
30
3.3.4 Elliptic Fourier Descriptors
The elliptic Fourier descriptors are generally used to describe closed con-
tours of shapes of objects in images, but can be applied to the open-ended
manifolds in this study as well. It represents the manifolds in terms of its
angles and slopes using coeﬃcients as presented in Subsection 2.3. For input
parameters the algorithm solely requires the manifold itself and a speciﬁed
number of harmonics it uses to create the shape spectrum. Experiments
showed that the number of harmonics does not aﬀect results when higher
than 10, thus to minimize memory costs the standard value of 10 was se-
lected. Therefore, the output is a 4 x 10 matrix of fourier shape descriptors.
These descriptors are invariant of scale and translational components and by
subsequently taking the absolute values of these descriptors, the rotational
component is eliminated as well. Thus, the issue of rotations/reﬂections in
manifolds as shown in Figure 3.9 is resolved.
3.4 Evaluation criteria
For evaluation purposes, it should be determined which classiﬁcation per-
centage indicates successfull recognition. Comparing other methods in the
literature, the minimal recognition rate of distinct hand gestures is around
60-85% [3]. Using Local Linear Embedding, [3] successfully recognized the
dynamics of hand gestures up to 93.2%. However, their gesture set consisted
only of gestures with ﬁnger extensions, whereas the gesture set of this study
contains gestures based on diﬀerences in wrist rotation, movement and ﬁn-
ger extensions. Therefore, the criterium for successfull recognition in this
thesis is a classiﬁcation percentage of minimally 60%, and preferably above
80%. Achieving a classiﬁcation percentage above 90% indicates excellent
recognition rates.
31
Chapter 4
Experimental results
This chapter reports the results of the main experiments performed in this
thesis. For the execution of the experiments, the mathematical program-
ming language Matlab R2007b was employed. The dataset was created as
explained in Chapter 2, purely for use in this study, although it might be
exploited in other studies as well. Subsection 4.1. provides results on clas-
siﬁcation percentages achieved with Isomap and t-SNE, whereas Subsection
4.2. presents several confusion matrices. Finally, Subsection 4.3 presents
the discussion of this thesis.
4.1 Classiﬁcation results
To ensure a correct classiﬁcation result, a 5-fold cross-validation procedure is
used in the experiments. Thus, the 125 videos were divided in ﬁve diﬀerent
ways to form the training- and test set by applying a ratio of 1/3 for the test
set and 2/3 for the training set. As there were 25 videos of each gesture,
the training set for each gesture consisted of 17 videos and the test set for
each gesture consisted of 8 videos. In total, the training set consisted of 85
videos whereas the test set consisted of 40 videos. To summarize, 5 separate
divisions of 85 training set videos and 40 test set videos were constructed
for the experiments.
Several experiments were conducted, as the GM-method comprises four
preprocessing approaches, two manifold learning techniques and two man-
ifold matching methods. Raw frames, binary diﬀerence-frames, change-
dependent frames and skin color frames are the four main approaches used
in the preprocessing. These four diﬀerent inputs are used by Isomap and
t-SNE, in addition to using either raw input frames, fourier descriptors or
procrustes analysis. The k-number of neighbors Isomap and the classiﬁca-
tion method use are ranged for comparison.
Figure 4.1 presents two graphs of average classiﬁcation performance
of Isomap, based on the 5-fold cross validation method, of these four ap-
32
Figure 4.1: Classiﬁcation percentages using raw frames as input for Isomap
with four approaches; raw Isomap (red, square), binary diﬀerence-frames
(blue, circle), change-dependent diﬀerence-frames (green, x) and skin color
frames (black, triangle). The left plot has k-number of neighbors of the
classiﬁcation method ranging from 3 to 15, whereas the second plot has the
k-number of neighbors Isomap uses ranging from 10 to 25.
Figure 4.2: Classiﬁcation percentages using Fourier descriptors as input for
Isomap with four approaches; raw Isomap (red, square), binary diﬀerence-
frames (blue, circle), change-dependent diﬀerence-frames (green, x) and skin
color frames (black, triangle). The left plot has k-number of neighbors of
the classiﬁcation method ranging from 3 to 15, whereas the second plot has
the k-number of neighbors Isomap uses ranging from 10 to 25.
33
Figure 4.3: Classiﬁcation percentages using Procrustes analysis as in-
put for Isomap with four approaches; raw Isomap (red, square), binary
diﬀerence-frames (blue, circle), change-dependent diﬀerence-frames (green,
x) and skin color frames (black, triangle). The left plot has k-number of
neighbors of the classiﬁcation method ranging from 3 to 15, whereas the
second plot has the k-number of neighbors Isomap uses ranging from 10 to
25.
proaches based on raw frames. The left plot shows the results when ranging
the k-number of neighbors the classiﬁcation method uses, whereas the right
plot shows results when ranging the k-number of neighbors Isomap employs.
For each k in both plots, the highest obtained percentage ranging the other
k-number of neighbors is selected.
Figure 4.2 presents similar plots with results now based on Fourier de-
scriptors as input instead of raw frames. Similarly, Figure 4.3 displays the
results where the approaches apply the Procrustes Analysis. These graphs
all represent results from Isomap, whereas the results from t-SNE are pre-
sented in Figure 4.4. Since the perplexity of t-SNE does not aﬀect the
results, only the k-number of neighbors from the classiﬁcation method were
ranged. The left plot of Figure 4.4 shows the results of raw frames, the mid-
dle plot results using Fourier descriptors and the bottom plot results using
Procrustes Analysis.
Overall, these graphs show that the k-number of neighbors of the classi-
ﬁcation method was best set between values of 3 and 5, indicating possible
smaller clusters of gestures. Whereas for the k-number of neighbors Isomap
uses, highest recognition rates were achieved with high values between 21
and 25, which suggests that many frames of the video are of high importance.
Through combining the results of the previous graphs two ﬁnal tables are
constructed and presented in Table 4.1 and Table 4.2. These tables display
respectively the overall average results of applying Isomap and t-SNE with
the four preprocessing approaches, in combination with raw frames, Fourier
descriptors or Procrustes Analysis.
34
Figure 4.4: Classiﬁcation percentages of t-SNE, while ranging the k-
number of neighbors of the classiﬁcation method when input diﬀers from;
raw frames (left plot), fourier descriptors (right plot), procrustes analysis
(bottom plot). Applied to t-SNE with four approaches; raw t-SNE (red,
square), binary diﬀerence-frames (blue, circle), change-dependent diﬀerence-
frames (green, x) and skin color frames (black, triangle).
Raw Binary Change- Skin-color
frames diﬀerence- dependent frames
frames frames
Isomap 53.6% ± 3.7 49.2% ± 3.7 44.2% ± 3.7 59.4% ± 4.3
Isomap 61.6% ± 8.4 75.4% ± 2.5 83.8% ± 2.9 79.8% ± 5.6
Fourier
Descriptors
Isomap 64.6% ± 6.2 70.8% ± 5.2 67.0% ± 4.5 60.4% ± 4.6
Procrustes
Analysis
Table 4.1: Highest classiﬁcation results of Isomap combined with four
preprocessing approaches and two manifold matching methods
35
Raw Binary Change- Skin-color
frames diﬀerence- dependent frames
frames frames
t-SNE 22.8% ± 2.9 23.2% ± 4.1 22.2% ± 4.5 27.6% ± 2.5
t-SNE 25.2% ± 8.3 34.6% ± 7.5 53.0% ± 4.1 41.8% ± 1.6
Fourier
Descriptors
t-SNE 26.4% ± 4.2 26.8% ± 7.6 31.2% ± 8.7 27.2% ± 6.3
Procrustes
Analysis
Table 4.2: Highest classiﬁcation results of t-SNE combined with four pre-
processing approaches and two manifold matching methods
4.2 Incorrectly classiﬁed gestures
Confusion tables represent classiﬁcation results per gesture, allowing better
comprehension of wrongly classiﬁed objects. The low performance of t-SNE
gives the impression that it is futile to construct confusion tables for this
method. For Isomap however it seems useful to produce average confusion
tables in order to conclude whether certain gestures are hard to identify or
which ones are easily classiﬁed. For the two best performing preprocessing
approaches, change-dependent diﬀerence-frames and skin color frames com-
bined with Fourier descriptors, average confusion tables were constructed.
Click Cut Grab Paste Move
Click 7.2 0.8 0.0 0.0 0.0
Cut 0.5 7.5 0.0 0.0 0.0
Grab 0.6 0.7 5.6 1.0 0.1
Paste 2.8 0.7 0.1 4.4 0.0
Move 0.0 0.2 0.3 0.5 7.0
Table 4.3: Average confusion table for Isomap combined with change-
dependent diﬀerence-frames
These tables were created using the average of the 5-fold cross validation
of the three best performing k-nearest neighbors for both Isomap and the
classiﬁcation method. The confusion table for change-dependent diﬀerence-
frames is displayed in Table 4.3 whereas the confusion table for skin color
36
frames is presented in Table 4.4. Note that the test set consisted of 8 videos
for each gesture, thus the maximum classiﬁcation value for each gesture in
these tables is 8.
Click Cut Grab Paste Move
Click 7.5 0.0 0.0 0.5 0.0
Cut 0.5 6.8 0.1 0.6 0.0
Grab 0.3 1.0 6.2 0.1 0.4
Paste 2.5 2.0 0.0 3.5 0.0
Move 0.0 1.0 0.9 0.2 5.9
Table 4.4: Average confusion table for Isomap combined with skin color
frames
The confusion tables both show similar results. The moves ‘click’, ‘cut’,
‘grab’ and ‘move’ are quite well classiﬁed, whereas the gesture ‘paste’ obtains
the lowest value in both approaches. In addition, the gesture is, again in
both confusion tables, most wrongly classiﬁed as a ‘click’ gesture. Looking at
start frames and ending frames of these gestures, as displayed in Figure 3.1,
the cause of the error is quite evident. Both gestures start with a ﬁst posture
in the middle of the frame and end with a ﬁst with one ﬁnger on the left
side of the ﬁst extended upwards. Although the approaches slightly detect
the diﬀerence between the wrist wrotation and simple ﬁnger extension, in
addition to the arm being at diﬀerent angles, the gestures simply appear
too similar for an optimal classiﬁcation result. Therefore, new experiments
were conducted while omitting the gesture ‘paste’, to see how positively it
would aﬀect the classiﬁcation results.
Change- Skin-color
dependent frames
frames
Isomap 91.6% ± 3.9 92.2% ± 3.4
Fourier
Descriptors
Table 4.5: Highest classiﬁcation results of Isomap with Fourier descriptors
using 4 gestures, combined with change-dependent diﬀerence-frames and
skin color frames
Only the best performing approaches, change-dependent diﬀerence-frames
and skin color frames, were used combined with Isomap and Fourier descrip-
37
tors. Table 4.5 presents results of these experiments.
In order to evaluate how well the change-dependent diﬀerence-frames ap-
proach performs on frames with more diﬃcult backgrounds, a very small
additional dataset was constructed consisting of 4 videos of the gesture ‘cut’,
ﬁlmed from a basic point of view of a user sitting behind his computer. The
background consisted of several multiple colored objects including a window,
indicating diﬀerent lighting conditions.
The k-number of neighbors of the classiﬁcation method was set to values
between 3 and 5, whereas the k-number of neighbors Isomap uses was set
to values between 21 and 25. The average of the classiﬁcation process is
shown in a confusion table, presented by Table 4.6. The table shows that
the videos were classiﬁed correctly for 85%.
Click Cut Grab Paste Move
Cut 0.2 3.4 0.0 0.3 0.0
Table 4.6: Confusion table of videos containing a diﬃcult background,
using Isomap combined with Fourier descriptors and change-dependent
diﬀerence-frames
4.3 Discussion
Focusing purely on motion in order to recognize hand gestures ensures sev-
eral advantages over analysing static frames, considering the various ap-
proaches in this study. However, several limitations have been discovered
as well. These advantages and general restrictions will now be explained,
combining the several approaches explained during this study.
In static frames, background objects inﬂuence the image analysis negatively,
as they possibly reduce the accuracy of identifying of the hand. Therefore,
additional algorithms are required to identify the hand, previous to analyz-
ing the hand posture. Diﬀerent lighting conditions which cause the hand
to appear darker/lighter may aﬀect the recognition in static frames nega-
tively as well. Using diﬀerence-frames, there is no necessity for additional
algorithms to identify the hand, since the focus is only on motion. For
this same reason, any static background objects have no inﬂuence in any
way using the diﬀerence-frames. Subsection 4.2 demonstrated that applying
the diﬀerence-frames approach to videos with a more detailed background
resulted in the same recognition rate.
The distance of the hand in frames thus far has troubled recognition in
static frames. To recognize hand postures of hands far away in frames is
38
rather complicated. However, using motion the recognition is to a certain
extent invariant to distance, as the motion remains the same however far
away the hand is situated in the video.
Thus, state-of-the-art techniques so far are hindered by background re-
strictions explained above. The GM-method using the diﬀerence-frames ap-
proach focusing purely on motion essentially solves these limitations. Any
other movements in the videos though may possibly decrease the perfor-
mance, as every diﬀerence between frames is noted. However, even human
beings have problems with recognizing several moving features at the same
time. Furthermore, the selected thresholds in the approach aid in determin-
ing whether the change between frames suﬃces, which may control a small
part of the other possible movements.
Using the color of the skin guarantees that the features of the hand
are extracted from the frames of the video. However, if the user has a
background with objects containing the same level of RGB channels as the
human skin, these objects will be taken into account as well. Clearly, this
would aﬀect the recognition performance negatively. When users have dif-
ferent skin colors, another adaptation is required in the selected thresholds
for the RGB channels as well. In addition, frames that are irrelevant due to
no movement, though they only slightly inﬂuence the overall manifold, are
taken into account as well. This limitation is solved by the diﬀerence-frames
approach, which ensures only relevant frames are considered.
The diﬀerence between results of Isomap and t-SNE show that it is
necessary to use a convex non-linear dimensionality reduction technique.
The non-convexity of the cost function of t-SNE causes a possible diﬀerent
result/manifold in each separate run, even if the technique is applied to
the exact same video. Evidently, this decreases the recognition performance
signiﬁcantly. Thus, the strategy employed in this study is restricted to a
convex non-linear dimensionality reduction technique.
When analyzing static frames, it is common when using the non-linear di-
mensionality reduction techniques like Isomap and t-SNE to input all frames
of all videos in one time. However, this requires enormous computational
and memory power, which limits the use of this approach. The focus on
motion in this study solves these restrictions since these techniques are used
for each video separately, which requires far less memory and computational
strength.
39
Chapter 5
Conclusions and future
research
This chapter oﬀers several conclusions drawn based on the results of this
study presented in Chapter 4. These conclusions are presented in Subsection
5.1, whereas Subsection 5.2 discusses shortcomings of this study and suggests
further recommendations.
5.1 Conclusions
This thesis has attempted automatic recognition of hand gestures in videos
by proposing a new technique, called the Gesture Manifold-method (GM-
method). This technique focuses purely on motion and aims to recognize ges-
tures in videos without analyzing static frames. Analyzing the motion of ges-
tures was possible using two non-linear dimensionality reduction techniques
for manifold learning; Isometric feature mapping (Isomap) and t-Distributed
Stochastic Neighbor Embedding (t-SNE). Four diﬀerent approaches have
been implemented in the preprocessing stage in order to successfully ex-
tract relevant features before the construction of manifolds. These ap-
proaches consist of: raw frames, binary diﬀerence-frames, change-dependent
diﬀerence-frames and skin color frames. Two methods for matching mani-
folds, Fourier descriptors and Procustes Analysis, have been applied as well
in combination with these approaches. For classiﬁcation, the well-known k-
nearest neighbour technique was implemented. A dataset was created using
a standard webcam and ﬁve diﬀerent persons. Five diﬀerent gestures were
designed, diﬀerent in movement, wrist wrotation and ﬁnger extension.
A 5-fold cross validation experiment was performed on the dataset, ob-
taining a classiﬁcation percentage for each combination of non-linear dimen-
sionality reduction technique, preprocessing approach and manifold match-
ing method. The speciﬁc research questions will now be answered in order,
followed by the problem statement and further conclusions.
40
The ﬁrst approach, using raw frames as input without applying a dataset
matching technique, required severe extensions, as its classiﬁcation percent-
age left much room for improvement. The binary diﬀerence-frames enhanced
this ﬁrst approach slightly, though recognition rates were not suﬃcient to
pass the evaluation criteria. However, it was possible to recognize the set
of ﬁve gestures rather well with change-dependent diﬀerence-frames or skin
color frames, when combined with the correct manifold learning techniques.
The change-dependent diﬀerence-frames approach achieved slightly better
results when recognizing 5 gestures, whereas the skin color frames approach
achieved a higher recognition rate when recognizing 4 gestures. However,
these diﬀerences were not signifcant, thus it can be concluded that change-
dependent diﬀerence-frames and skin color frames are both most eﬀective
in eliminating background noise and obtaining regions of interest, hence in-
creasing the construction of clearly discernible manifolds.
In the manifold learning stage, the t-SNE method was unable to create
quality manifolds to represent gestures correctly, due to the non-convexity
of its cost function, as explained in Subsection 4.3. It can be concluded
that although t-SNE excels at visualizing high-dimensional data on a low-
dimensional level and is able to outperform most state-of-the-art dimension-
ality reduction techniques, it is not applicable when focusing on matching
manifolds of separate videos. However, the Isomap technique has a convex
cost function and is very suitable to produce clearly discernible manifolds of
separate videos. It can be concluded that Isomap is the non-linear dimen-
sionality reduction technique most eﬀective for creating quality manifolds of
separate videos.
Considering the classiﬁcation percentages of the two diﬀerent dataset match-
ing methods employed in the manifold learning phase, results clearly show
that approaches using Fourier descriptors outperform the approaches using
the Procrustes Analysis signiﬁcantly. Thus, Fourier descriptors are much
more eﬀective in aligning manifolds for improved recognition rates.
Confusion tables revealed that the ‘paste’ gesture was most faultily classiﬁed
in both best performing combinations, and was generally wrongly identiﬁed
as a ‘click’ gesture. Considering that both gestures have similar starting
and ending frames, it seems logical that these two gestures are occasionally
confused with each other, although the algorithm is still able to classify a rea-
sonable percentage. New experiments were performed omitting the ‘paste’
gesture, enabling the same previous two combinations of approaches to ob-
tain excellent classiﬁcation percentages. Afterwards, additional experiments
on videos with more detailed backgrounds proved that the diﬀerence-frames
approach is invariant to lighting conditions and backgrounds with multiple
41
colored objects.
Considering the evaluation criteria, the preferred classiﬁcation percent-
age was certainly achieved when recognizing 5 gestures, whereas excellent
recognition rates were realised when classifying a set of 4 gestures. Thus, it
can be concluded that using the GM-method, combining the optimal meth-
ods in each stage as speciﬁed in the previous conclusions, hand gestures in
videos can be recognized very well.
5.2 Future research
The GM-method is able to identify these selected four/ﬁve gesture quite well,
but additional testing is required to evaluate how well the approach performs
on a larger set of gestures. For example, the American Sign Language (ASL)
contains a large set of gestures which can possibly serve as a grand test set.
Further research in this approach could eventually help the ASL users to
communicate remotely with each other.
The gestures of the dataset are at the moment videos containing solely
the start and ending of the dataset. To achieve real-time recognition, addi-
tional algorithms are required to determine when gestures start and ﬁnish.
However, this feat seems quite achievable when using the diﬀerence-frames
approach.
Although the videos now only contain the start and ending of the gesture,
the gestures are not aligned in time, which means there is a diﬀerence in the
speed of the movements. For better classiﬁcation results, a technique such
as dynamic time warping can be applied, which is able to align sequences of
videos. Other classiﬁcation methods can be applied as well, such as Support
Vector Machines or Neural networks, in order to improve the recognition
rate.
The skin color frames approach currently has trouble identifying gestures
when background objects have the same color as the human hands. Possi-
ble improvements for this approach includes hand detection using contour
signatures or similar methods. Combining the skin color frames approach
with diﬀerence-frames might solve the complication as well, since diﬀerence-
frames are invariant of non-moving background objects. However, for envi-
ronments with other moving objects than the hand performing the gesture,
additional research is required to determine which moving object is the hand.
When it is possible to truly recognize the hand under these circumstances,
this approach focusing on motion can ﬁnally replace the keyboard and mouse
in the new promising Perceptual User Interface.
42
Bibliography
[1] J. Blackburn and E. Ribeiro. Human motion recognition using isomap
and dynamic time warping. In Workshop on Human Motion, pages
285–298, 2007.
[2] E. W. Dijkstra. A note on two problems in connexion with graphs.
Numerische Mathematik, 1:269–271, December 1959.
[3] S. Ge, Y. Yang, and T. Lee. Hand gesture recognition and tracking
based on distributed locally linear embedding. Image and Vision Com-
puting, pages 1607–1620, 2008.
[4] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances
in Neural Information Processing Systems 15, pages 833–840, 2003.
[5] A. M. J.A. Cook, I. SutsKever and G. E. Hinton. Visualizing similarity
data with a mixture of maps. In 11th International Conference on
Artiﬁcial Intelligence and Statistics (2), pages 67–74, 2007.
[6] Y. Jeong and R. J. Radke. Reslicing axially-sampled 3d shapes using
elliptic abstract fourier descriptors. Medical Image Analysis, pages 197–
206, 2007.
[7] F. Kuhl and C. Giardina. Elliptic fourier features of a closed contour.
Computer Graphics and Image Processing 18, pages 236–258, 1982.
[8] J. J. LaViola Jr. A survey of hand posture and gesture recognition
techniques and technology. Technical report, Department of Computer
Science, Brown University, 1999.
[9] C. Lee and Y. Xu. Online, interactive learning of gestures for hu-
man/robot interfaces. In IEEE International Conference on Robotics
and Automation, pages 2982–2987, 1996.
[10] L. Maaten and G. Hinton. Visualizing data using t-sne. Journal of
Machine Learning Research, 2008.
43
[11] S. Malassiotis, F. Tsalakanidou, N. Mavridis, V. Giagourta, N. Gram-
malidis, and M. G. Strintzis. A face and gesture recognition system
based on an active stereo sensor. In International Conference on Image
Processing 3, pages 955–958, 2001.
[12] C. J. Morris and D. S. Ebert. An experimental analysis of the eﬀec-
tiveness of features in chernoﬀ faces. In 28th Applied Imagery Pattern
Recognition Workshop, pages 12–17, 2000.
[13] V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpretation
of hand gestures for human-computer interaction: A review. IEEE
Transactions on Pattern Analysis and Machine Intelligence, pages 677–
695, 1997.
[14] P. Peixoto and J. Carreira. A natural hand gesture human computer in-
terface using contour signatures. Technical report, Institute of Systems
and Robotics, University of Coimbra, Portugal, 2005.
[15] R. Pless. Image spaces and video trajectories: Using isomap to explore
video sequences. In Ninth IEEE International Conference on Computer
Vision (ICCV03), pages 1433–1441, 2003.
[16] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by
locally linear embedding. Science, pages 2323–2326, 2000.
[17] A. Sandberg. Gesture recognition using neural networks. Master’s
thesis, Stockholm University, 1997.
[18] N. Sebe, M. S. Lew, and T. S. Huang, editors. Computer Vision
in Human-Computer Interaction, Lecture Notes in Computer Science,
2004.
[19] T. Starner and A. Pentl. Visual recognition of american sign language
using hidden markov models. In International Workshop on Automatic
Face and Gesture Recognition, pages 189–194, 1995.
[20] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric
framework for nonlinear dimensionality reduction. Science, pages 2319–
2323, 2000.
[21] R. Watson. A survey of gesture recognition techniques. Technical re-
port, Department of Computer Science, Trinity College, Dublin, Ire-
land, 1993.
[22] Wikipedia. Procrustes analysis, http://en.wikipedia.org/wiki/Procrustes analysis,
2007.
[23] T. G. Zimmerman and J. Lanier. A hand gesture interface device. ACM
SIGCHI/GI, pages 189–192, 1987.
44