To link to the entire object, paste this link in email, IM or documentTo embed the entire object, paste this HTML in websiteTo link to this page, paste this link in email, IM or documentTo embed this page, paste this HTML in website

Computational modeling and utilization of attention, surprise and attention gating

15539.pdf
[57.21 MB]
Link will provide options to open or save document.

File Format:

Adobe Reader

COMPUTATIONAL MODELING AND UTILIZATION OF ATTENTION,
SURPRISE AND ATTENTION GATING
by
Terrell Nathan Mundhenk
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2009
Copyright 2009 Terrell Nathan Mundhenk
ii
Epigraph
“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living;
It's a way of looking at life through the wrong end of a telescope. Which is what I do, and
that enables you to laugh at life's realities.”
Dr Seuss
iii
Dedication
For my parents Terry and Ann
iv
Acknowledgements
This really is the trickiest part to write because I want to thank so many people for
so many things. First off I would like to thank my sister Amy (Zon) who thought enough
of me to whip out a copy of the Schizophrenia paper I wrote with Michael Arbib when
she visited that research neurologist at the Cleveland Clinic. I thought that was funny, but
it made me feel as if I was doing something interesting. Then there are my closest friends
Paul Gunton, Brant Heflin, Mike Olson and Tim Olson who exhibited confidence in my
ability to actually complete this silly thing. I would also like to thank Kate Svyatets for
sticking with me through my thesis and mood swings. Life is meaningless without friends
and loved ones, so I am in your debt.
I would also like to acknowledge the excellent scientists I worked closely with
over the years, without whom my research would not have been possible. Firstly this
includes Michael Arbib who has essentially been a co-thesis advisor to me. I can hardly
communicate the enormous amount of things which I learned from him over the years.
His honesty and integrity were of great value and, I knew when he said something, he
really meant it. Next, I would like to thank Kirstie Bellman and Chris Landauer at the
Aerospace Corporation. They gave me more of an idea of what real engineering is about
than just about anyone. I also want to give a great big thanks to Wolfgang Einhäuser my
co-author on several publications. I don’t think I’ve ever worked with anyone so totally
on the ball. I also need to mention Ernest Hall who was my mentor during my
undergraduate years and source of encouragement during my graduate years. I don’t
think I would be in a research field if it wasn’t for him.
v
I also must extend my deepest gratitude to the many faculty members who have
provided excellent feedback and conversation over the years of my graduate education. I
cannot mention every teacher and mentor who touched my life over the past few years,
because there were so many. However I would like to extend special thanks to: Irving
Biederman, Christof Koch, Christoph von der Malsburg, Bartlett Mel and Stefan Schaal.
I would also like to acknowledge many of the students and post-doctoral
researchers I have collaborated with or were in general very helpful in assisting me with
my research through direct assistance or discussion. They are: Jeff Begley, James
‘Jimmy’ Bonaiuto, Mihail Bota, Vincent J. Chen, Aaron D'Souza, Nitin Dhavale, Lior
Elazary, Jacob Everist, Doug Garrett, Larry Kite, Hendrik Makaliwe, Salvador Marmol,
Thomas Moulard, Pradeep Natarajan, Vidhya Navalpakkam, Jan Peters, Rob Peters, Eric
Pichon, Jack Rininger, Christian Siagian, and Rand Voorhies. If I forgot to mention
anyone, I’m really quite sorry.
Lastly, but far from leastly, I would like to thank my Thesis Advisor Laurent Itti
for all his help, input and encouragement he has provided over the years. After my first
year at USC, I was getting kind of board being out of the research game. At the time, I
was taking robotics from Stefan Schaal. I talked to him about who was doing interesting
research in computer vision and he suggested I talk to a promising new faculty member. I
took him up on his advice which turned out to be an excellent decision. iLab was new and
only had a few students back then, now it so vibrant and full of life with so many
projects. I will surely miss Laurent and iLab and I am certain I will look back on these
days with great positive satisfaction.
vi
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List of Tables x
List of Figures xi
Abbreviations xv
Abstract xviii
Preface xix
About this thesis xix
Graduate works not included in this thesis xx
Other works of interest not included in this thesis xxi
Don’t read the whole thesis xxii
Chapter 1: A Brief Introduction to Vision and Attention 1
1.1 What Does our Brain Want to Look For? 5
1.2 How Does our Brain Search for What it Wants? 9
1.2.1 What’s a Feature? 9
1.2.2 How do we Integrate These Features? 12
1.2.3 Beyond the Basic Saliency Model 17
1.3 The Current State of Attention and Other Models 18
1.3.1 Top-Down Models 19
1.3.2 Other Contemporary Models of Saliency 20
1.3.3 The Surprise Model 22
Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency
Using Vision Feature Analysis Toolkit (VFAT) 23
2.1.1 Vision, Tracking and Prior Information 23
2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors 26
2.1.4 The iRoom and Meta-prior Information 30
2.2 Saliency, Feature Classification and the Complex Tracker 31
2.2.1 Complex Feature Tracker Components 34
2.3 The Simple Feature Based Tracker 59
2.4 Linking the Simple and Complex Tracker 61
2.5 Results 64
2.6 Discussion 66
2.6.1 Noticing 66
vii
2.6.2 Mixed Experts 66
2.6.3 System Limitations and Future Work 67
Chapter 3: Contour Integration and Visual Saliency 69
3.1 Computation 75
3.2 The model 77
3.2.1 Features 77
3.2.2 The Process 83
3.2.3 Kernel 87
3.2.4 Pseudo-Convolution 91
3.3 Experiments 97
3.3.1 Local element enhancement 98
3.3.2 Non-local Element Enhancement 103
3.3.3 Sensitivity to Non-contour Elements 112
3.3.4 Real World Image Testing 118
3.4 Discussion 122
3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) 125
3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity 126
3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex 127
3.4.4 Model Limitations 128
3.5 Conclusion 129
Chapter 4: Using an Automatic Computation of an Image’s Surprise to
Predicts Performance of Observers on a Natural Image Detection Task 130
4.1.1 Overview of Attention and Target Detection 131
4.1.2 Surprise and Attention Capture 134
4.2 Methods 136
4.2.1 Surprise in Brief 136
4.2.2 Using Surprise to Extract Image Statistics from Sequences 139
4.3 Results 144
4.4 A Neural Network Model to Predict RSVP Performance 152
4.4.1 Data Collection 153
4.4.2 Surprise Analysis 154
4.4.3 Training Using a Neural Network 154
4.4.4 Validation and Results of the Neural Network Performance 156
4.5 Discussion 164
4.5.1 Patterns of the Two-Stage Model 165
4.5.2 Information Necessity, Attention Gating and Biological Relevance 169
4.5.3 Generalization of Results 173
4.5.4 Comparison with Previous RSVP Model Prediction Work 173
4.5.5 Network Performance 174
4.5.6 Applications of the Surprise System 175
4.6 Conclusion 176
viii
Chapter 5: Modeling of Attentional Gating using Statistical Surprise 177
5.1 From Surprise to Attentional Gating 180
5.2 Methods 183
5.2.1 Paradigm 183
5.2.2 Computational Methods 184
5.3 Results 188
5.3.1 Relation of Results to Previous Studies Which Showed Causal Links
between Surprise and Target Detection 193
5.4 Discussion 196
5.4.1 Variability of the Attention Gate Size Fits within the Paradigm 196
5.4.2 The Attention Gate may Account for Some Split Attention Effects 198
5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps 199
Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of
Attention Gate (MAG) 201
6.1 The MAG Method for Comparison of Different Models 201
6.1.1 Fishers Linear Discriminant and Fitness 203
6.1.2 Data Sets Used 206
6.2 Comparison of Opponent Color Spaces using MAG 207
6.2.1 iLab RGBY 210
6.2.2 CIE Lab 211
6.2.3 iLab H2SV2 214
6.2.4 MAG Comparison of Color Spaces 214
6.3 Addition of Junction Feature Channels 216
6.4 Comparison of Different Statistical Models 217
6.5: Checking the Problem with Beta 219
6.5.1 Asymptotic Behavior of β 219
6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? 221
6.5 Method Performance Conclusion 226
References 228
Appendices 245
Appendix A: Contour Integration Model Parameters 245
Appendix B: Mathematical Details on Surprise 246
Appendix C: Kullback-Liebler Divergences of Selected Probability
Distributions 253
C.1 Conceptual Notes on the KL Distance 253
C.2 KL of the Gaussian Probability Distribution 255
C.3 KL of the Gamma Probability Distribution 255
C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution 258
Appendix D: Junction Channel Computation and Source 262
D.1 Junction Channel Source Code 264
Appendix E: RGBY and CIE Lab Color Conversion 267
E.1 RGBY Color Conversion 267
ix
E.2 CIE Lab Color Conversion 268
Appendix F: HSV Color Conversion Source 273
F.1 RGB to HSV Transformation 274
F.1.1 HSV Transformation C / C++ Code 275
F.2 HSV to RGB Transformation: 278
F.2.1 RGB Transformation C/C++ Code 279
Appendix G: H2SV Color Conversion Source 281
G.1 HSV to H2SV Transformation 282
G.1.1 HSV to H2SV1 Variant 282
G.1.2 HSV to H2SV2 Variant 283
G.2 H2SV to HSV Simple Transformation 284
G.2.1 H2SV1 to HSV Simple 284
G.2.2 H2SV2 to HSV Simple 284
G.3 H2SV to HSV Robust Transformation 285
G.3.1 General Computations: 285
G.3.2 C / C++ Code for Robust Transformation 286
Appendix H: Selected Figure Graphing Commands for Mathematica 288
x
List of Tables
Table 2.1: Variance accounted for in ICA/PCA. 50
Table 3.1: Table of probabilities of results at random. 107
Table 3.2: Types of features found salient by CINNIC. 119
Table 4.1: M-W feature significance per type. 145
Table 6.1: MAG scores for color spaces. 213
Table 6.2: MAG scores for junction filters. 215
Table 6.3: MAG scores for statistical models. 217
Table 6.4: MAG scores for different values of beta. 223
xi
List of Figures
Figure 1.1: Examples of retinotopic maps of the visual cortex. 2
Figure 1.2: What does the brain find visually interesting? 4
Figure 1.3: Why the brain looks for so many types of features. 6
Figure 1.4: The increasing complexity of the visual system. 7
Figure 1.5: Examples of basic feature detectors. 11
Figure 1.6: Generations of feature based attention models. 13
Figure 1.7: Orientation features and Gabor pyramid example with Ashes. 15
Figure 1.8: Butterfly regions and contour integration example. 16
Figure 1.9: Examples of top-down models of attention. 19
Figure 2.1: Bayesian priors and Meta Priors spectrum. 26
Figure 2.2: From features to ICA to clustering. 32
Figure 2.3: The VFAT architecture graph. 35
Figure 2.4: General saliency model graph. 36
Figure 2.5: Junction detection from INVT features with ICA. 38
Figure 2.6: Feature clustering example shown with node climbing. 40
Figure 2.7: Examples of feature clustering on different data points. 42
Figure 2.8: NPclassify compared with K-means. 43
Figure 2.9: Example of similarity by statistical overlap. 45
Figure 2.10: Example of feature output following ICA/PCA. 51
Figure 2.11: ICA inversion and color features. 53
Figure 2.12: Example of image feature clustering. 54
xii
Figure 2.13: NPclassify compared quantitatively with K-means. 57
Figure 2.14: Features clustered during tracking. 58
Figure 2.15: The simple feature tracker. 59
Figure 2.16: The complex tracker handing off to simple trackers. 62
Figure 2.17: Screen shot of the VFAT based tracker. 63
Figure 3.1: The Braun Make Snake contour. 70
Figure 3.2: The basics of contour alignment and processing. 78
Figure 3.3: Neuron priming diagram. 80
Figure 3.4: Neuron group suppression in theory. 82
Figure 3.5: The basics of the CINNIC alignment and processing. 84
Figure 3.6: Hypercolumns and pseudo-convolution. 91
Figure 3.7: Breakdown of the CINNIC process. 95
Figure 3.8: CINNIC multiple scales and averaging. 96
Figure 3.9: 2AFC simulation for the Polat Sagi display. 99
Figure 3.10: Fit of CINNIC to observer AM. 101
Figure 3.11: Interaction of element size and enhancement. 103
Figure 3.12: CINNIC working on Make Snake contours. 105
Figure 3.13: Performance of CINNIC on Make Snake contours. 106
Figure 3.14: The subjective perception of contours and element separation. 108
Figure 3.15: Accounting for performance of CINNIC with kernel size. 110
Figure 3.16: CINNIC sensitivity to junctions. 113
Figure 3.17: Explaining sensitivity of junctions by CINNIC. 115
Figure 3.18: CINNIC sensitivity to salient locations and face features. 120
xiii
Figure 3.19: CINNIC and fast plasticity. 127
Figure 4.1: Overview of the surprise system. 138
Figure 4.2: The surprise map over sequence frames. 141
Figure 4.3: Peaks of surprise seem predictive. 144
Figure 4.4: Mean surprise and visual features. 148
Figure 4.5: Standard deviation of surprise and visual features. 150
Figure 4.6: Spatial location of max surprise and visual features. 151
Figure 4.7: The surprise prediction system. 155
Figure 4.8: How surprise prediction was analyzed. 158
Figure 4.9: Performance of surprise prediction. 162
Figure 4.10: Theoretical aspects of surprise prediction. 171
Figure 5.1: Surprise peaks at flankers for hard targets. 179
Figure 5.2: Attention gating and the contents of working memory. 180
Figure 5.3: From RSVP to attention gate computation. 182
Figure 5.4: Computation of the attention gate. 186
Figure 5.5: Computing the overlap ratio. 189
Figure 5.6: Surprise attention gate quantitative results. 191
Figure 5.7: Subjective results on Transportation Targets. 192
Figure 5.8: Subjective results on Animal Targets. 193
Figure 5.9: Explaining past results for Easy-to-Hard. 195
Figure 5.10: Attention gating and detecting multiple targets. 199
Figure 6.1: Which of the two models is better or worse? 202
Figure 6.2: Pretty fisher information graph 205
xiv
Figure 6.3: The MAG, an overview. 207
Figure 6.4: A general color space overview. 208
Figure 6.5: RGBY Color space example. 210
Figure 6.6: CIE Lab color space example. 211
Figure 6.7: H2SV2 color space example. 212
Figure 6.8: MAG and color space results. 213
Figure 6.9: MAG and junction filter results. 215
Figure 6.10: MAG and statistical model results. 217
Figure 6.11: The asymptotic behavior of beta. 220
Figure 6.12: MAG performance for different values of beta. 223
Figure B.1: Different views on the Gamma PDF. 247
Figure B.2: Surprise in Wows! 248
Figure B.3: The DoG Filter. 251
Figure C.1: From a PDF to the integrated KL region. 254
Figure C.2: The Joint gamma-gamma KL. 257
Figure D.1: The junction filter. 262
Figure E.1: CIE 1931 XYZ color space. 269
Figure E.2: Map of the CIE Lab gamut space. 270
Figure F.1: HSV color space. 273
Figure G.1: H2SV color space. 281
xv
Abbreviations
AI Artificial Intelligence
AIP Anterior Interparietal Sulcus
AMD Advanced Micro Devices
BPNN Back Propagation Neural Network
CIE International Commission on Illumination
CINNIC Carefully Implemented Neural Network for
Integrating Contours
CRT Cathode Ray Tube (monitor)
DoG Difference of Gaussian
EPSP Excitatory Post Synaptic Potential
EQ Equation
ERF Error Function
ERFC Complementary Error Function
fMRI Functional (Nuclear) Magnetic Resonance Imaging
FS Fast Spiking
GABA Gamma Aminobutyric Acid
GB Gigabyte (1 billion bytes)
GCC GNU C++ Compiler
GIMP GNU Image Manipulation Program
GNU GNU's Not Unix [sic] (An open source, free
software consortium)
GPL GNU General Public License
xvi
HSV Hue/Saturation/Value
H2SV HSV Variant with two hue components
H2SV2 H2SV with Red/Green Blue/Yellow opponents
Hz Hertz (cycles per second)
ICA Independent Component Analysis
INVT iLab Neuromorphic Vision Toolkit
IPSP Inhibitory Post Synaptic Potential
IT Inferior Temporal Cortex
KL Kullback-Liebler Divergence (sometimes called the
KL distance)
Lab CIE Lab Color (Luminance with two
opponents, a Red/Green b Blue/Yellow)
MAG Metric of Attention Gate
MHz Megahertz (1,000,000 cycles per second)
ms Milliseconds (1/1000 of a second)
O Worst Case Asymptotic Complexity (called
the big “O” notation)
OpenCV Open Computer Vision (Intel Toolkit)
PCA Principal Component Analysis
PDF Probability Distribution Function
PFC Pre-Frontal Cortex
POMDP Partially Observable Markov Decision Process
RAM Random Access Memory
RGB Red, Green and Blue Color
xvii
RGBY Red/Green and Blue/Yellow Color
RMSE Root Mean Squared Error
RSVP Rapid Serial Vision Presentation
SMA Supplementary Motor Area
SQRT Square Root
T Terrell
TD Temporal Difference
V1 Primary Visual Cortex
V2 – V5 Regions of Extrastriate Cortex
VFAT Vision Feature Analysis Toolkit
WTA Winner Take All
xviii
Abstract
What draws in human attention and can we create computational models of it
which work the same way? Here we explore this question with several attentional models
and applications of them. They are each designed to address a missing fundamental
function of attention from the original saliency model designed by Itti and Koch. These
include temporal based attention and attention from non-classical feature interactions.
Additionally, attention is utilized in an applied setting for the purposes of video tracking.
Attention for non-classical feature interactions is handled by a model called CINNIC. It
faithfully implements a model of contour integration in visual cortex. It is able to
integrate illusory contours of unconnected elements such that the contours “pop-out” as
they are supposed to and matches in behavior the performance of human observers.
Temporal attention is discussed in the context of an implementation and extensions to a
model of surprise. We show that surprise predicts well subject performance on natural
image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an
attention gate works in the human visual cortex. The attention gate derived from surprise
also gives us a good idea of how visual information is passed to further processing in
later stages of the human brain. It is also discussed how to extend the model of surprise
using a Metric of Attention Gating (MAG) as a baseline for model performance. This
allows us to find different model components and parameters which better explain the
attentional blink in RSVP.
xix
Preface
About this thesis
This thesis is about the computational modeling of visual attention and surprise.
The aspects that will be covered in this work include:
• Utilization of the computation of attention in engineering.
• Extensions to the computational model of attention and surprise.
• Explaining human visual attention and cognition from simulation using
computational models.
This work is integrative and based on the philosophy that computer vision is aided
by better understanding of the human brain and it’s already developed exquisite
mechanisms for dealing with the visual world as we know it. At the same time,
development of biologically inspired computer vision techniques, when done correctly,
yields insight into the theoretical workings of the human brain. Thus, the integration of
engineering, neuroscience and cognitive science gives rise to useful synergy.
The second chapter covers the utilization of saliency as an engineering topic. This
is an example of applying what we have learned from the human brain towards an
engineering goal pursued with real world applications in mind. It is somewhat more
applied and as a result, many components are not biologically motivated. The reader
should keep in mind that project goals placed constraints on what can be done. In this
xx
case, a real time system able to process images very quickly was needed. Additionally,
the project as is typical for engineering endeavors required “deliverables”.
Chapters three and six cover methods for extending or changing the way in which
surprise is computed. In the case of the former, a model of contour integration is created
and examined. This allowed the creation of an extension to the basic saliency model for
non-local interactions. Its primary contribution however turned out to be gainful
knowledge of the human visual mechanisms involved.
The fourth and fifth chapters deal with temporal dimensions of attention using
surprise. The goals are to test and extend the model to see if predictions can be made of
observer performance. Thus, it is suggested that a better fit model, which is improved in
its ability to predict human performance, is closer to the actual mechanisms which the
human brain uses. This also has reciprocal engineering applications since it can be used
to help determine what humans will attend to in a dynamic scene.
Graduate works not included in this thesis
I have tried to keep all work included in this document constrained to the topic of
visual attention and to work with salient results. As such, much of the work I have done
in pursuit of my doctorate is not included. These works include, but are not limited to (in
chronological order):
• The Beobot Project (Mundhenk, Ackerman, Chung, Dhavale, Hudson,
Hirata, Pichon, Shi, Tsui & Itti, 2003a)
• Schizophrenia and the Mirror Neuron System (Arbib & Mundhenk, 2005)
• Estimation of missing data in acoustic samples (Mundhenk, 2005)
xxi
• Surprise Reduction and Control (Mundhenk & Itti, 2006)
• Three Dimensional Saliency (Mundhenk & Itti, 2007)
Of interest in particular is the work on Schizophrenia and Mirror Neuron system which
has been cited 45 times according to Google scholar. Also of interest is the Beobot
project paper which was the most downloaded paper from iLab for three years straight,
and it is still in the top five downloads to this day.
Other works of interest not included in this thesis
Also not included is the large amount of educational materials created and posted
online. These include:
• http://www.cool-ai.com –AI homeworks, projects and lecture notes for usage in
AI courses.
• Wikipedia and Wikiversity – contributions including:
o http://en.wikiversity.org/wiki/Learning_and_Neural_Networks - Created
self-guided teaching page on Neural Networks.
o http://en.wikipedia.org/wiki/Cicadidae - Contributed Wikipedia featured
picture of the day and written content.
o http://en.wikipedia.org/wiki/Gamma_distribution - contributed graphics
and corrections.
o http://en.wikipedia.org/wiki/Kullback-Leibler_divergence - contributed
graphics and corrections.
o http://en.wikipedia.org/wiki/Methods_of_computing_square_roots -
Added algorithms and analysis.
xxii
• http://www.cinnic.org/CINNIC-primer.htm – Contour Integration Primer.
• http://www.nerd-cam.com/how-to/ - Detailed Instructions on how to build your
own robotic camera.
Don’t read the whole thesis
This thesis uses the standard “stapled papers” framework. While each chapter has
been integrated into a coherent work, they each will stand on their own. As a result, the
reader is advised to get what they want and get out. That is, go ahead and read a chapter
which interests you, but don’t bother to read other parts. However, there tends to be more
information here than in the authors papers cited. As such, this thesis may be of use in
getting some of the model details not covered in the authors published materials due to
space constraints in peer reviewed journals.
Have fun
T. Nathan Mundhenk
1
Chapter 1: A Brief Introduction to Vision and Attention
You got to work today without running over any pedestrians. How did you do
that? To be sure this is a good thing. You can pick up items without even thinking about
it; you can thumb through a magazine until you get to a favorite advertisement; You can
tell a shoe from a phone and you can tell if that giant grizzly bear is in fact gunning for
your ice cream cone. You do all sorts of things like this every day and frequently they
seem utterly simple. To be certain, sometimes you cannot find your keys to save your
life, but even while searching for them, you don’t bang into the furniture in your
apartment, at least to too much. How did you do this? I ask, because like just about every
person on earth, I’m not totally sure.
OK, true, you’ll be glad to know I have some ideas. However, the pages that
follow will only scratch the surface of how human beings such as ourselves view the
world. To this day, much of human vision still remains a mystery. However, many things
about human vision are well established. For instance, we do in fact see from our eyes
and the information from them does travel to our brain. The brain itself is where what we
see is processed and it turns out that its job it not merely to cool our blood as Aristotle
believed it to be.
However, there is a place between seeing and understanding which resides within
human brain itself, and how it takes the items in the world and places them into your
mind is a complicated story. In this work, we will focus on an important part of this
process, the notion of selection and attention. The idea as it were is that not everything
presented to our eyes makes its way from the retina in the eyes to the seat of
2
consciousness. Instead, it seems that most of what we perceive is just a fraction of what
we could. The brain is picky, and it only selects some things to present to us, but many
other things simply fade from being.
Figure 1.1: Retinotopy has been demonstrated repeatedly over the years in the visual cortex. Thus, its
existence is well founded. An early example is given by Gordon Holmes who studied brain injuries in
solders after the first world war (Holmes, 1917, Holmes, 1945) and traced visual deficits to specific injury
cites in visual cortex. Then with primate (Macaque) cortex experiments using Deoxyglucose (Tootell,
Silverman, Switkes & De Valois, 1982) it was shown that a pattern activated a region of visual cortex with
the same shape. However, this method was limited due to the fact that the animal had to be sacrificed
immediately after viewing the pattern in order to reveal and photograph the pattern on the cortex. Later in
2003, with fMRI using sophisticated moving target displays, (Dougherty, Koch, Brewer, Fischer,
Modersitzki & Wandell, 2003) regions in the human brain were shown to correspond to locations in the
visual cortex in much the same way. However, fMRI allows observation in healthy human volunteers,
which is a distinct advantage since more advanced experiments such as those involving motion can be
conducted.
What then does the brain do to select the things it wants to see? One could
suppose that a magic elf sits in a black box in the brain with a black magic marker
3
looking at photos of the world sent to it by the eyes. The elf inspects each photo and
decides if it’s something it believes you should see. Otherwise it marks it with an ‘x’
which means that another magic elf should throw the image away. The idea of magic
elves as a brain process is intriguing, however the evidence does not bear it out. Then
again, the brain is in some sense a black box. Thus, while we do not think that magic
elves are the basis for cognition, we still must make inferences about the brains basic
workings from a variety of frequently indirect evidence.
For instance, we can probe the brain of other primates. In figure 1.1 it is shown
that we know that the visual cortex receives information from the eyes in retinotopic
coordinates. We know this from experiments on primates where briefly flashed visual
patterns caused a similar pattern to form on the visual cortex (Inouye, 1909, Holmes,
1917, Holmes, 1945, Tootell et al., 1982). Does the same thing happen in the human
brain? The general consensus is yes, many pieces of visual information from the eye line
up on the back of the brain somewhat like a movie projecting onto a screen. Newer
studies with functional magnetic resonance imaging (fMRI) on humans reinforces this
idea (Horton & Hoyt, 1991, Hadjikhani, Liu, Dale, Cavanagh & Tootell, 1998,
Dougherty et al., 2003, Whitney, Goltz, Thomas, Gati, Menon & Goodale, 2003). Still,
the evidence is indirect. No one has seen the movie on the back of the brain, but
fortunately, the evidence is satisfying.
Retinotopy in the visual cortex is an example of something which is well founded
even if the evidence is sometimes indirect. However, do we have such a good notion
about how the brain selects what it wants to see from input coming from the eyes? It
turns out sort of, but not completely. However, this is not without good reason. What
4
captures ones attention is quite complex (Shiffrin & Schneider, 1977, Treisman &
Gormican, 1988, Mack & Rock, 1998). So for instance, things which are colorful tend to
get through the brain much easier than things which are dull. This is for instance why
stop signs are red and not gray. This is also why poisonous snakes or monarch butterflies
(which are also poisonous) have such vivid colors. Interestingly, it is not just the colors
which attract our attention it is how the colors interact. For instance, something which is
blue attracts more attention when it is next to something yellow while something red
tends to get more attention when it is next to something green. So it’s not just the color of
something that makes it more salient, it’s how the colors interact as opponents.
Figure 1.2: What does the brain find visually interesting? There are many things (from left to right). Good
continuation of objects which form a larger percept is interesting. Conspicuous colors, particularly the
opponents red/green and blue/yellow stand out. Objects with unique features and special asymmetries
(Treisman & Souther, 1985) compared with surrounding objects can stand out. Also motion is a very
important cue.
Ok, seems pretty simple, but that was just one piece of a rather gigantic puzzle.
Just a sampling of what is visually interesting is shown in figure 1.2. It turns out that
edges, bright patches, things which are moving and things which are lined up like cars in
traffic and … well many things all can attract your attention and control what it is that
your brain deems interesting. Still it gets even more complex, your brain itself can decide
to change the game and shift the priority on certain visual features. As an example, if you
5
are looking for a red checker, your brain could decide to turn up the juice on the red color
channel. That is, your brain can from the top-down change the importance of visual items
making some things which were less interesting more interesting and vice versa (Shiffrin
& Schneider, 1977, Wolfe, 1994a, Navalpakkam & Itti, 2007, Olivers & Meeter, 2008).
So just on the front, we can see that the notion of visual attention and what gets
from the retina in the eyes to the seat of thought is quite complex. It involves a great deal
of things which interact in rather complex and puzzling ways. However, as mentioned we
do know many things, and we are discovering new properties every day. Hopefully this
work will help to illuminate some of the processes by which the visual world can pass
through brain into the realm of thought.
1.1 What Does our Brain Want to Look For?
Imagine that the world was not in color. Further, imagine that all you could see
was the outlines of the stuff that makes up the world. You would still need to move
around without tipping over chairs and be able to eat and recognize food. What then
would draw your attention? You can still tell how to identify many things. After all, it is
the world of lines which makes up the Sunday comic strips. You might not be able to tell
something’s apart which you could back in our colorful world, but for the most part you
could tell a table from a chair or an apple from a snake. In this case, what would your
brain look for?
6
Figure 1.3: Why does the brain look for so many different types of features? It depends on what it needs to
find. Some images are defined by lines, others by colors and some by the arrangement of objects. All of the
images shown are interpretable even though typical natural scene information of one type or another seems
lacking. Shown from Top Left: Atari’s Superman, Picaso’s La Femme Qui Pleure, Gary Larson’s Far Side;
Bottom Left: Liquid Television’s Stick Figure Theater, Van Gogh’s Starry Night Over the Rhone.
In basic terms, what your brain wants to look for is information. Figure 1.3 shows
several different scenes which one can interpret even though the information is presented
very differently with typical information components such as color, lines or texture
missing. As will be reviewed later, images are comprised of features, which are the
essential bits of information for an image. These can include all of the above as well as
more complex features such as junctions and motion. Not all features are necessary for
object identification. A typical example is that people were able to enjoy television before
it was in color.
7
Figure 1.4: (Left) Features gain increasing complexity and their responses become more and more task
dependent. Additionally, visual information is sent down multiple pathways for different kinds of
processing (Fagg & Arbib, 1998). Here the task of grasping a mug will prime features related to a mug top-down.
These features in turn will be processed in different ways depending on whether we are trying to
identify the mug (Ventral: What) or if we are trying to understand its affordances (Dorsal: How) (Ingle,
Schneider, Trevathen & Held, 1967). How the brain splits visual information in this way and then
reintegrates it, is still not completely understood.1 (Right) The connection diagram of Felleman and Van
Essen (Felleman & Van Essen, 1991) of the primate visual cortex demonstrates that elegant models such as
the one by Fagg and Arbib still only scratch the surface of the full complexity of the workings of the brain.
In addition to the essential bits of an image which are important, what the brain
wants to see is also based on the task at hand. Figure 1.4 illustrates a model for the task
of grasping an object (Fagg & Arbib, 1998). Initially the object to be grasped must be
spotted. If a person has some idea of what they are looking for, then they can attempt to
try and focus their attention towards something that matches the expected features of the
object. For instance, if the object to be grasped is a red mug, then the initial search for it
should bias one to look for red and round things. Such a bias becomes even more
1 This is a reconceptualiztion of the original Fagg & Arbib figure which appears in:
[44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford:
Oxford University Press.
8
important in a cluttered scene where many simple salient items may be a distraction.
Otherwise, finding a red mug in a plain white room would be more simple.
Once the object has been spotted, appropriate features must be further extracted
such as geometric land marks (Biederman, 1987, Biederman & Cooper, 1991). So the
brain will need to find essential characteristics of the object for the task. In this case, we
want to grasp or pick up the object. If a portrait of Weird Al Yankovic is painted on the
side of the mug, it might grab our attention, but it is unimportant for the task of acquiring
the mug. Instead, we should ignore the portrait and just scan the geometry. The task
might be entirely different if we had another action we wanted to execute. For instance, if
someone asks us whose face is on the mug, we would want to scan for face like features
and perhaps ignore the geometric properties completely.
In the mug example, we can imagine that many other factors might come into
play. For instance the scene might change unexpectedly. As an example, our clumsy
relative might have knocked over the mug. This sudden change in the scene would come
as a surprise and should initiate a change in attention priorities. If the coffee is flowing
towards my notebook computer I should notice that as soon as possible. Then I should
perhaps cancel my grasping action and search for paper towels or maybe make a grasp
for my computer.
The brain also sometimes has very little choice in what it looks for. Some things
are highly salient such as a stop sign or an attractive person. It can be hard to override the
innate bottom-up search system at times. Thus, many things are attended to fairly quickly
and automatically. This is a rather important trait, a rock hurling towards you at great
speed demands your attention more than a cup of coffee. As such, we can see that what
9
the brain wants to see also depends on automatic bottom-up systems which can preempt
current task demands.
1.2 How Does our Brain Search for What it Wants?
1.2.1 What’s a Feature?
What the brain wants to see is based on what is useful for it to see. Early on, after
the invention of photography in the 19th century, many artists began to rethink what it
was that they were doing. Up until then, artists created the essence of photographs with a
paint brush, but since a machine could do the same thing faster and cheaper, direct
photographic style artistry seemed like it would become archaic. This helped to bring
about the Impressionist style of art. What is notable to our discussion is that artists began
to experiment with imagery where fundamental features of a painting could be altered,
but the scene could still be interpreted. As structure and form of art was changed and
experimented with, it became more obvious that the brain did not need a direct
photograph of a scene in order to understand it. Instead, it merely needed some form of
structure which resembled the original scene.
Partially as a result of this new way of looking at the world, early 20th century
cognitive scientists began to think about how objects and components of an image could
be linked together to create something which the brain could understand. Both
Structuralists such as William James (James, 1890) and in particular Gestalt
psychologists such as Max Wertheimer (Wertheimer, 1923) and Kurt Koffka (Koffka,
1935) began to think about how the brain can take in parts of a scene and assemble them
10
into something the brain understands. They believed that perception was a sum of the
parts, but at the time they lacked the scientific abilities to prove their ideas.
That the visual world was composed of parts which the brain assembles had been
proposed. However, what these parts looked like or what form they took was far from
certain. Several theories came forward over the years to refine what kind of parts the
brain uses to create the whole. A popular term for the elementary parts of an image was
features. Several scientists in the 1950’s such as Gibson (Gibson, 1950), Barlow (Barlow,
1961) and Attneave (Attneave, 1954) began to note that prior information about shapes,
line and textures could be collected and used to interpret abstracted scenes statistically. In
particular, Fred Attneave proposed that much of the visual world is redundant and
unnecessary for the task of recognition. A cat for instance could be represented as points
(or perhaps better as junctions) which are connected by the brain to form the perception
of a cat. Under this assumption, a large mass of visual information presented to the
retinal, for instance all the parts of the image which are not junctions are extraneous.
Partially as a result of such assertions, several theories were put forward claiming that
there should be a bottleneck in attention (Broadbent, 1958, Deutsch & Deutsch, 1963).
As such, the picture of the visual world was still hazy, but several theories were now
giving an idea of how the brain sees the world and what it wants to find. First, the brain
compiles images from parts to create a whole. Second, features of an image as simple as
points, lines, textures or junctions scattered about a scene may be sufficient in order for
the brain to understand an image, but that there may be limits on how much the brain can
process at one time. However, several questions remained. First, what kind of features is
11
the brain looking for and second how does the brain look for and process these features
keeping in mind that it has some limitations on capacity?
Figure 1.5: (Left) Early visual processing by the brain looks for simple features. For instance the retinal
begins by sorting out color opponents such as red/green and blue/yellow (Kuffler, 1953, Meyer, 1976).
While the understanding of the center surround mechanism is somewhat recent, knowledge of the
arrangement of color opponents is very old and its theory can be traced at least as far back as to the German
physiologist Ewald Hering in 1872 (Hering, 1872) but was first described physiologically in the goldfish
(Daw, 1967). We can simulate these mechanisms using the filters shown. Here we see DoG (Difference of
Gaussian) filters which give the On Center / Off Surround response (von Békésy, 1967, Henn & Grüsser,
1968) for colors (Luschow & Nothdurft, 1993). (Right) Later, the visual cortex utilizes hyper columns
(Hubel & Wiesel, 1974) to find lines in an image. We can use wavelets like the one on the right to give a
response to lines in an image (Leventhal, 1991). The type of wavelets used are typically called Gabor
wavelets in honor of the Hungarian engineer Gábor Dénes (Dennis Gabor). (Bottom) The bottom row
shows a cross section of the filters on the top.
The answers to these questions began to congeal with the development of
improved psychometric instrumentation in the 1960s that could better time and control
the reaction of human subjects with a wide variety of stimulus. [For instance see
(Sperling, 1960, Raab, 1963, Sperling, 1965, Weisstein & Haber, 1965)]. This was
accompanied by improved psychophysical instrumentation capable of direct
12
measurement of neural activity in animals [For instance (Daw, 1968, Henn & Grüsser,
1968)]. By the 1970’s combined with the seminal work by David Hubel and Torsten
Wiesel (Hubel & Weisel, 1977) we were starting to get a pretty good idea of what kind
of elementary features the brain is looking for. In figure 1.5 we see some of the features
which we knew the brain to be sensitive to by the mid 1970’s. The brain has literal
detectors for lines and color opponents such as red/green and blue/yellow. It should be
noted however, that this is still the beginning of the story. We knew that there was a set
of simple features which the visual cortex would pick up on, but there was no idea how
these features could be assembled into larger objects. Additionally, were there more
features or was this the full basis set?
1.2.2 How do we Integrate These Features?
By the 1970’s two important concepts were beginning to emerge. One was the
notion of focused attention. That is, if Attneave and his contemporaries are correct, the
brain might be wise to only spend time processing parts of a scene and not the whole
thing. Second, features such as lines and colors integrate and bind in the brain. For
instance, it had been known since the 1930’s that the brain can bind colors and words.
John Stroop (Stroop, 1935) showed that by flashing a word such as “blue” but coloring it
red tended to trip up and slow down observers when asked to name it. Would such a
mechanism also apply at the level of feature integration?
13
Figure 1.6: Three generations of models of feature based attention are shown in succession. Starting with
Treisman, Gelade & Gormican (Treisman & Gelade, 1980, Treisman & Gormican, 1988)2 it was
hypothesized that the way visual features such as lines and colors integrate in parallel controls the serial
components of attention. This model itself is a refinement of earlier theories of attention, for instance
Shiffrin and Schneiders theory of automatic and controlled attention (Shiffrin & Schneider, 1977) and the
pre-attentive and focal attention model of Neisser (Neisser, 1967). Later Koch and Ullman (Koch &
Ullman, 1985) expanded this with the notion of having a saliency map which controls the spotlight of
attention with a winner-take-all network. Following this, it was made into a fully functional computational
model by Itti and Koch (Itti & Koch, 2001b).
Several theoretical constructs were advanced and lead to increasing understanding
on the question of attention (Figure 1.6). It was discovered that attention seems to be
focal and that only parts of an image actually reach what many people would call
consciousness. In 1967, this hypothesis was put forward by Ulric Neisser (Neisser, 1967)
who suggested that there was a pre-attentive phase to visual processing when features
were gathered together in parallel, but that later the features combined and were inspected
serially by focal attention. This was further expanded by Richard Shiffrin and Walter
Schneider (Shiffrin & Schneider, 1977) who saw a second dimension to attention. They
suggested that some parts of attention are automatic and some parts are controlled. That
2 This drawing is from Treisman and Gormican 1988. It is based on the feature integration theory given in
Treisman and Gelade 1980. However, Treisman and Souther 1985 gives a very similar figure.
14
is, some features in an image grab our attention automatically and almost reflexively.
However, we are also consciously able control some things which we attend to. This is
what is now thought of in broader terms as bottom-up and top-down attention.
In 1980, Anne Treisman and Gerry Gelade further refined these ideas into a
Feature Integration theory of attention (Treisman & Gelade, 1980). There idea was that
the parallel computation of Neisser could be split into different features which could be
processed separately in the pre-attentive stage and then brought together. Thus, the brain
would compute its interest in colors, lines and intensities at the same time and that it is
the sum integration of different features which determines the locus of attention. That is,
attention is driven simultaneously be each type of feature, but the conjunction or
independent dominance of a feature can draw in attention. However, the question was left
open as to how the features could combine to create a master map of attention. A possible
answer was given by Christof Koch and Shimon Ullman (Koch & Ullman, 1985) who
gave the idea that the brain maintained a saliency map for the visual world and that a max
selector processes (Didday, 1976, Amari & Arbib, 1977) would refine the saliency map
so that only a single location in the visual field would stick out. This allowed for many
things in the world to be salient at the same time, but suggested that the most salient item
of all is that one which the brain will attend to.
The theories of attention put forward by Treisman et al as well as Koch and
Ullman gained further support over the next decade due to a variety of experimental
results [For examples see (Nothdurft, 1991b, Nothdurft, 1991a, Nothdurft, 1992,
Luschow & Nothdurft, 1993)]. In 1998 Laurent Itti, Christof Koch and Ernst Niebur
further refined the model of Koch and Ullman and created a comprehensive
15
computational model that allowed direct testing of it (Itti, Koch & Niebur, 1998). It also
included a comprehensive set of feature detectors as well as a Gaussian/Laplacian
pyramid to detect features at many different scales (Figure 1.7).
Figure 1.7: Gabor wavelet filters give a response to lines in an image. One way to do this is to create four
or more wavelet filters each with its own directional orientation (Itti et al., 1998). On the left this can be
seen as filters sensitive to lines are 0, 45, 90 and 135 degrees. On the right is an image which has been
convolved by the filters at 0 and 90 degrees and the lines that were extracted by the filters. Since lines have
different sizes we can convolve each image at a different scale to increase our chances of discovering lines
of different widths (Tanimoto & Pavlidis, 1975, Burt & Adelson, 1983, Greenspan, Belongie, Goodman,
Perona, Rakshit & Anderson, 1994)3.
The essential gain was that the computer could be treated like a brain in a box. If the
model of Koch and Ullman was correct, then a comprehensive computational model
should have parity with the behavior of humans. Initial results showed that the saliency
3 The cats name is Ashes.
16
Figure 1.8: (Top Row) Features that the brain is looking for get increasingly complex. This happens
frequently when simpler features are combined to create new ones (Field, Hayes & Hess, 1993, Kovács &
Julesz, 1993, Polat & Sagi, 1994, Gilbert, Das, Ito, Kapadia & Westheimer, 1996, Li, 1998, Mundhenk &
Itti, 2005). For instance, line fragments which Gabor filters pick up on can then be connected in a
corresponding zone which completes contours. The butterfly pattern on the left will complete a contour
when line fragments lie in the green zone and are aligned. This can be seen on the right where three co-linearly
aligned fragments enhance each other to give a larger response. The graph is somewhat crude, but
the point is that the more elements that are aligned, the stronger the response. (Bottom Row) The elements
aligned into a circle on the left are much more salient than random elements (Kovács & Julesz, 1993,
Braun, 1999). They should produce an activation pattern like the one on the right (Mundhenk & Itti, 2003,
Mundhenk & Itti, 2005).This is discussed at length in chapter 3.
17
model behaved in a manner that was expected (Itti & Koch, 2001b). The computational
saliency model was able to detect many odd-man-out features, search asymmetries and
conditions for pop-out that would be expected of human observers. Additionally, the
model could be augmented to included top-down attentional effects (Itti, 2000) by
adjusting features weights in a manner similar to the mechanism proposed 25 years
earlier for directed attention by Shiffrin and Schneider (Shiffrin & Schneider, 1977).
Thus, for instance, when looking for a red Coke can, it is almost a simple matter to
weight the red feature more during search.
1.2.3 Beyond the Basic Saliency Model
The original saliency model of Itti and Koch lacked three components. One was
the interaction of non-local features. Thus, as can be seen in figure 1.8, contours and line
segments which extend past the classic receptive fields of the basic feature detectors have
been found to be salient (Kovács & Julesz, 1993, Polat & Sagi, 1993b, Gilbert et al.,
1996, Braun, 1999, Geisler, Perry, Super & Gallogly, 2001). The second element missing
was temporal attention. This itself is comprised of three components which may or may
not be independent of each other. They are motion, change and masking. Thus, things
which are in motion tend to draw our attention. However simple changes such as the
appearance or disappearance of an element in a video can draw or attention as well
(Mack & Rock, 1998). The third element of temporal attention, masking, has been
studied quite extensively (Breitmeyer & Öğmen, 2006). It is where something at one
instance in a sequence of images is blocked from perception by something spatially
proximal that comes before or after it. It includes both backwards and forwards masking,
18
the attentional blink (Raymond, Shapiro & Arnell, 1992) and both automatic and
controlled mechanisms (Sperling & Weichselgartner, 1995, Olivers & Meeter, 2008).
Further, the temporal components of attention are hypothesized to be comprised of more
than one processing stage (Chun & Potter, 1995). The third element, top-down attention
has been partially implemented since the original model was incepted (Itti, 2000,
Navalpakkam & Itti, 2005). However, a complete model of top-down attention is
probably many years away since it requires construction of the “top” component which
may include consciousness itself.
A non-local extension to the saliency model was eventually provided by T Nathan
Mundhenk (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005) and was extensively tested.
This is covered in chapter 3. The extensions to temporal saliency are covered in
chapters 2, 4, 5 and 6. They include extensions by the addition of a motion channel in
chapter 2 (Mundhenk, Landauer, Bellman, Arbib & Itti, 2004b, Mundhenk,
Navalpakkam, Makaliwe, Vasudevan & Itti, 2004c, Mundhenk, Everist, Landauer, Itti &
Bellman, 2005a) and extension by the usage of Bayesian Surprise in chapters 4, 5 and 6
(Itti & Baldi, 2005, Einhäuser, Mundhenk, Baldi, Koch & Itti, 2007b, Mundhenk,
Einhäuser & Itti, 2009).
1.3 The Current State of Attention and Other Models
Many contemporary models of attention are designed to address one or more of
the shortcomings of the original saliency model discussed in the last section, while many
are attempts at general improvements or are different models altogether.
19
1.3.1 Top-Down Models
Modeling the factors of top-down v. bottom-up attention goes back very far. As
can be seen in figure 1.9 an early model was provided by Shiffrin and Schneider, but that
model lacked a good notion of feature integration as well as an attentional map. Jeremy
Wolfe (Wolfe, 1994a) provided a good synthesis of the model of Shiffrin and Schneider
with the model of Koch and Ullman. Thus, the affects of top-down controll were merged
with a feature integration attention model which also included an attention map.
However, this is an example of a static scene top-down model. That is, prior knowledge
is integrated as a top-down mechanism, but not necessarily online. Current extensions of
this model include the integration of task influence (Navalpakkam & Itti, 2005) as well as
an explanation of feature tuning (Navalpakkam & Itti, 2007).
Figure 1.9: (Left) An early example of an attention model with top-down guided search activation is the
attention model of Shiffrin and Schneider (Shiffrin & Schneider, 1977). Here automatic parallel processing
layers that compute attention can be controlled by a more serialized attention director. (Right) The model
by Wolfe (Wolfe, 1994a) is conceptually a synthesis of Shiffrin & Schneider with Koch and Ullman (Koch
& Ullman, 1985). That is, it has added feature integration and a saliency map.
20
Many other models which integrate top-down attention are concerned with online
handling of features as well as task demands. Sperling et al (Sperling, Reeves, Blaser, Lu
& Weichselgartner, 2001) has provided one such model with a gamma shaped window
function of attention. Task it treated as a spatial cue to certain locations allowing a
“Quantal” discrete attention window to be opened at that location for a certain amount of
time. It also includes bottom-up attention using the original term “automatic” attention.
However, like with the model of Wolfe, it has not been nearly as completely
implemented as the Itti and Koch model. One might consider it a partial implementation
in comparison.
A recent and important contribution to the modeling of top-down attention is
provided by Olivers and Meeter. This is known as the Boost and Bounce theory of
attention (Olivers & Meeter, 2008). In many ways it is an extension of Sperling et al, but
it has more explicit handling of features as well as an improved description of the
interaction of frontal cortical mechanisms with visual cortical processing. Again,
however, the implementation is very computationally limited.
1.3.2 Other Contemporary Models of Saliency
Currently there are a variety of other attention models in existence. Some are
variants of the model of Itti and Koch (Frintrop, 2006, Itti & Baldi, 2006, Gao,
Mahadevan & Vasconcelos, 2008) while others are more unique (Cave, 1999, Li, 2002,
Bruce & Tsotsos, 2006). The model by Simone Frintrop is known as VOCUS. Its goal is
to use models of saliency to improve computer vision search. It implements top-down
task improvements in a manner similar to Itti and Koch, but adds a top-down
21
excitation/inhibition mechanism. It also uses the CIE Lab (McLaren, 1976) color space
for color opponents and implements a form of 3D saliency for laser range finders.
Dashan Gao et al (Gao et al., 2008) have implemented an interesting variation on
Itti and Koch which is to change the treatment of center surround interactions. The center
surround response is termed “Discriminant” center surround because it forms a center
surround response based on the strength of a linear discriminant. The more crisp the
discrimination of the center of a location is when compared with its surround, the
stronger a response is given at that location. However, this is a mechanism very similar to
the way the model of Surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) computes spatial
attention.
The model by Bruce and Tsotsos (Bruce & Tsotsos, 2006) is an information
maximization model. It works by taking in a series of images and forming a bases set of
features. The bases set is then used to convolve an image. The response to each basis
feature is competed against the basis features from all other patches. Thus, if a basis
feature gives a unique response at an image location, it is considered salient. The most
notable difference with this model compared with Itti and Koch is the derivation of basis
features from prior images similar to Olshausen and Field (Olshausen & Field, 1996).
However, the rectification using a neural network may compute competition in a way
which is not sufficiently different from a WTA competition, but it may be arguably more
biologically plausible.
The model by Li is much more different. Li’s model (Li, 2002) is strongly model
theoretic and somewhat neglects the task of image processing. However, it is claimed that
it can provide saliency pre-attentively without the use of separate feature saliency maps.
22
Thus, the model should compute a singular saliency from combining features responses
at the same time. This may be a more plausible method for computing saliency, but it is
unclear if it functionally gains much over other models of saliency.
1.3.3 The Surprise Model
There are two notable trends with saliency models. One is the emergence of
information theoretic constructs and the other is the continued divergence between static
saliency models and dynamic models of attention. With the recent exception of Gao (Gao
et al., 2008) attention models were either static feature based models or dynamic, but
primarily theoretical models (Sperling et al., 2001). The introduction of Surprise based
attention (Itti & Baldi, 2005, Itti & Baldi, 2006) created for the first time a statistically
sound and dynamic model of attention. In chapter 4, we will introduce surprise based
attention and show that it does an excellent job of taking into account dynamic attentional
effects seen in rapid serial vision experiments. This is then shown to give a good
framework for a short term attention gate mechanism in chapter 5.
In short, the new framework has some similarities to Bruce and Tsostos in that
prior images are used to create belief about new images. However, surprise computes
these beliefs online. This means that it does not need to be trained or have strong prior
information about feature prevalence. Instead the sequence provides the needed
information. The extensive testing and validation in chapters 4-6 also demonstrate firmly
that it explains many temporal attention effects. Additionally, we postulate that we have
gained further insight into the attentional window into the brain.
23
Chapter 2: Distributed Biologically Based Real Time Tracking
with Saliency Using Vision Feature Analysis Toolkit (VFAT)4
In a prior project, we developed a multi agent system for noticing and tracking
different visual targets in a room. This was known as the iRoom project. Several aspects
of this system included both individual noticing and acquisition of unknown targets as
well as sharing that information with other tracking agents (Mundhenk et al., 2003a,
Mundhenk, Dhavale, Marmol, Calleja, Navalpakkam, Bellman, Landauer, Arbib & Itti,
2003b). This chapter is primarily concerned with a combined tracker that uses the
saliency of targets to notice them. It then classifies them without strong prior knowledge
(priors) of their visual feature, and passes that information about the targets to a tracker,
which conversely requires prior information about features in order to track them. This
combination of trackers allows us to find unknown, but interesting objects in a scene and
classify them well enough to track them. Additionally, information gathered can be
placed into a signature about objects being tracked and shared with other camera agents.
The signature that can be passed is helpful for many reasons since it can bias other agents
towards a shared target as well as help in creating task dependant tracking.
2.1.1 Vision, Tracking and Prior Information
For most target acquisition and tracking purposes, prior information about the
targets features is needed in order for the tracker to perform its task. For instance, a basic
color tracker that tracks objects based on color needs to know a priori what the color of
4 For more information see also: http://ilab.usc.edu/wiki/index.php/VFAT_Tech_Doc
24
the target that it wishes to track is. If one is going to track a flying grape fruit, then one
would set a tracker with a certain color of yellow and some threshold about which the
color can vary. In general, many newer trackers use statistical information about an
objects features which allows one to define seemingly more natural boundaries for what
features one would expect to find on a target (Lowe, 1999, Mundhenk et al., 2004b,
Mundhenk et al., 2004c, Mundhenk et al., 2005a, Siagian & Itti, 2007). However, in
order to deploy such a tracker, one needs to find the features, which describe the object
before tracking it. This creates two interesting problems. The first problem is that the set
of training examples may be insufficient to describe the real world domain of an object.
That is, the trainer leaves out examples from training data, which may hold important
information about certain variants of an object. We might think for instance from our
flying grapefruit tracking example, that of the fruits that fly by, oranges never do. As a
result, we would unknowingly let our tracker have some leeway and track grapefruit that
might even be orange in appearance. It might however turn out that we were wrong. At
some point, an orange flies by and our tracker tracks it the same as a flying grapefruit.
This can happen for several reasons, the first is that we had never observed an orange fly
by and as such didn’t realize that indeed, they can fly by. Another reason is that the world
changed. When we set up the tracker, only grapefruits could fly by. However, the thing
that makes them fly, now acts on oranges, which may be an accidental change, for
instance if an orange tree begins to grow in our flying grapefruit orchard. However, it
might also be the case that someone has decided to start throwing oranges in front of our
tracker. As such, the domain of trackable objects can change either accidentally or
25
intentionally. In such a case, our tracker may now erroneously tracks flying oranges as
flying grapefruit.
As can be seen from our first example, our tracker might fail if someone tries to
fool it. Someone starts throwing oranges in front of our tracker, or perhaps they might
wrap our grapefruits in a red wrapper so that our tracker thinks they are apples. If we are
selling our flying grapefruits and our tracker is supposed to make sure each one makes it
to a shipping crate, it would fail if someone sneaks them by as another fruit. As such,
once a dishonest person learns what our tracker is looking for, it becomes much easier to
fool. This is seen in the real world in security applications, such as Spam filtering, where
many email security companies have to update information on what constitutes Spam on
a regular bases to deal with spammers who learn simple ways around the filters. It should
be expected that the same problem would go for any other security related application
including a vision-based tracker. In the case of our flying grapefruit tracker, its function
may not be explicitly security related, but as a device related to accounting, it is prone to
tampering.
What is needed then for vision based tracking is the ability to be able to define its
own priors. It has been proposed that gestalt rules of continuity and motion allow visual
information to be learned without necessarily needing prior information about what
features individual objects possess (Von der Malsberg, 1981, Prodöhl, Würtz & von der
Malsberg, 2003, Mundhenk et al., 2004b, Mundhenk & Itti, 2005). That is, the human
visual system does not necessarily know what it is looking for, but it knows how to learn
how to look. This itself constitutes a kind of prior information which one might consider
meta-prior information. That is, information about what structure or meta-model is
26
needed to gather prior information, such as Bayesian information, is itself a type of prior
information. Using meta-prior information, an artificial agent might learn on its own how
to form groups that can be used to create statistical relationships and build new prior
information about what it wishes to track. Thus, abstractly speaking, meta-priors are
concerned with learning about how to learn.
2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors
Figure 2.1: It is interesting to note how different AI solutions require different amounts of prior
information in order to function. Additionally, it seems that the more prior information a solution requires
the more certainty it has in its results, but the more biased it becomes towards those results. Thus, we can
place solutions along a spectrum based on the prior information required. Popular solutions such as Back
Propagation Neural Networks and Support Vector Machines seem to fall in the middle of the spectrum in
essence making them green machines and earning them the reputation of being the 2nd best solution for
every problem.
We propose that meta-priors are part of a spectrum of knowledge acquisition and
understanding. At one end of the spectrum, are the rigid rules of logic and induction from
which decisions are drawn with great certainty, but with which unknown variables must
be sparse enough to make those reasonable decisions (figure 2.1). In the middle we place
27
more traditional statistical methods, which either require what we will define as strong
meta-priors in order to work or require Bayesian priors. We place the statistical machines
in the middle, since they allow for error and random elements as part of probabilities and
do not need to know everything about a target. Instead, they need to understand the
variance of information and draw decisions about what should be expected. Typically,
this is gifted to a statistical learner in the form of a kernel or graph. Alternatively, the
meta-prior does not make an inference about knowledge itself, but instead is used to
understand its construction. From this, we then state, that meta-priors can lead to
Bayesian priors, which can then lead to logical inductive priors. From meta-priors we
have the greatest flexibility about our understanding of the world and in general terms,
the least amount of bias; whereas on the other end of the spectrum, logical inductive
priors have the least flexibility, but have the greatest certainty.
An ideal agent should be able to reason about its knowledge along this spectrum.
If a probability becomes very strong, then it can become a logical rule. However, if a
logical rule fails, then one should reason about the probability of it doing so.
Additionally, new things may occur which have yet unknown statistical properties. As
such, the meta-priors can be used to promote raw data into a statistical framework or to
re-reason about a statistical framework, which now seems invalid.
Using certain kinds of meta-prior information, many Bayesian systems are able to
find groupings which can serve as prior information to other programs which are unable
to do so themselves. However, most Bayesian models work from meta-priors that require
a variety of strong meta-priors. For instance, the most common requirement is that the
number of object or feature classes must be specified. This can be seen in expectation
28
maximization, K-means and back-propagation neural networks, which need to have a set
size for how many classes exist in the space they inspect. The number of classes thus,
becomes a strong and rather inflexible meta-prior for these methods. Additionally, other
strong meta-priors may include space size, data distribution types and the choice of
kernel.
The interesting thing about meta-priors is that they can be flexible or rigid. For
instance, specifying you have several classes that are fit by a Gaussian distribution is
semi-flexible in that you have some leeway in the covariance of your data, but the
distribution of the data should be uni-modal and have a generally elliptical shape. An
example of more rigid meta-priors would be specifying a priori the number of classes
you believe you will have. So for instance, going back to our grapefruit example, if you
believe your data to be Gaussian, you suspect that flying grapefruit have a mean color
with some variance in that color. You can make a more rigid assumption that you will
only see three classes such as, flying grapefruit, oranges and apples. All of these are of
course part of the design process, but as mentioned they are prone to their own special
problems. Ideally, an intelligent agent that wishes to reason about the world should have
the ability to reason with flexible weak meta-priors but then use those to define Bayesian
like priors. Here we define weak meta-priors as having flexible parameters that can
automatically adjust to different situations. So for instance, we might set up a computer
vision system and describe for it the statistical features of grapefruit, oranges and apples.
However, the system should be able to define new classes from observation either by
noticing that a mass of objects (or points) seem to be able to form their own category
(Rosenblatt, 1962, Dempster, Laird & Rubin, 1977, Boser, Guyon & Vapnik, 1992, Jain,
29
Murty & Flynn, 1999, Müller, Mika, Rätsch, Tsuda & Schölkopf, 2001, Mundhenk et al.,
2004b, Mundhenk et al., 2005a) or through violation of expectation and surprise (Itti &
Baldi, 2005, Itti & Baldi, 2006). An example of the first is that if we cluster data points
that describe objects, and if a new object appears such as a kiwi, a new constellation of
points will emerge. An example of the second is that if we expect an apple to fly by, but
see an orange, it suggests something interesting is going on. It might be that new fruit
have entered our domain. In the first case, our learning is inductive, while in the second
case it is more deductive.
We thus define weak meta-priors to be situationally independent. That is, the
meta-prior information can vary depending on the situation and the data. Ideally,
information within the data itself is what drives this flexibility. So for instance, when
selecting what is the most salient object in a scene, we might select a yellow ball.
However, a moving gray ball may be more salient if presented at the same time as the
yellow ball. Thus, the selection feature for what is most salient is not constantly a color,
but can also be motion. So it is the interplay of these features, which can promote the
saliency of one object over the other (Treisman & Gelade, 1980). Yet another example is
that the number of classes is not defined a priori as a strong meta-prior, but instead,
variance between features causes them to coalesce into classes. So as an abstract
example, the number of planets in a solar system is not pre-determined. Instead, the
interplay of physical forces between matter will eventually build a certain number of
planets. Thus, the physical forces of nature are abstractly a weak meta-prior for what kind
of planets will emerge, and how many will be formed.
30
2.1.4 The iRoom and Meta-prior Information
Here we now review a vision system for following and tracking objects and
people in a room or other spaces that can process at the level of weak meta-priors,
Bayesian priors and even logical inductive priors. From this, we then need artificial
experts, which can use weak meta-priors to process information into more precise
statistical and Bayesian form information. Additionally, once we know things with a
degree of certainty, it is optimal to create rules for how the system should behave. That
is, we input visual information looking for new information from weak meta-priors,
which can be used to augment a vision system that uses Bayesian information. Eventually
strong Bayesian information can be used to create logical rules. We will describe this
process in greater detail in the following pages but give a brief description here.
Using a biological model of visual saliency from the iLab Neuromorphic Vision
Toolkit (INVT) we find what is interesting in a visual scene. We then use it to extract
visual features from salient locations (Itti & Koch, 2001b) and group them into classes
using a non-parametric and highly flexible weak-meta prior classifier NPclassify
(Mundhenk et al., 2004b, Mundhenk et al., 2005a). This creates initial information about
a scene: for instance how many classes of objects seem present in a scene, where they are
and what general features they contain. We then track objects using this statistically
priorless tracker but gain advantage by taking the information from this tracker and
handing it to a simple tracker, which uses statistical adaptation to track a target with
greater effectiveness. In essence, it takes in initial information and then computes its own
statistical information from a framework using weak meta-prior information. That
31
statistical information is then used as a statistical prior in another simpler and faster
tracker.
2.2 Saliency, Feature Classification and the Complex Tracker
There were several components used in the tracking system in iRoom. As
mentioned, these started by needing less meta-prior information and then gathering
information that allows the tracking of targets by more robust trackers that require more
information about the target. The first step is to notice the target. This is done using
visual saliency. Here very basic gestalt rules about the uniqueness of features in a scene
are used to promote objects as more or less salient (Treisman & Gelade, 1980, Koch &
Ullman, 1985, Itti & Koch, 2001b). This is done by competing image feature locations
against each other. A weak image feature that is not very unique will tend to be
suppressed by other image features, while strong image features that are different will
tend to pop out as it receives less inhibition. In general, the saliency model acts as a kind
of max selector over competing image features. The result from this stage is a saliency
map that tells us how salient each pixel in an image is. Once the saliency of locations in
an image can be computed, we can extract information about the features at those
locations. This is done using a Monte Carlo like selection that treats the saliency map as a
statistical map for these purposes. The more salient a location in an image is, the more
likely we are to select a feature from that location. In the current working version we
select about 600 feature locations from each frame of video. Each of the feature locations
contains information about the image such as color, texture and motion information.
These are combined together and used to
32
Figure 2.2: The complex feature tracker is a composite of several solutions. It first uses INVT visual
saliency to notice objects of interest in a scene. Independent Component Analysis and Principle Component
Analysis (Jollife, 1986, Bell & Sejnowski, 1995, Hyvärinen, 1999) are used to reduce dimensions and
condition the information from features extracted at salient locations. These are fed to a non-parametric
clustering based classification algorithm called NPclassify, which identifies the feature classes in each
image. The feature classes are used as signatures that allow the complex tracker to compare objects across
frames and additionally share that information with other trackers such as the simple tracker discussed
later. The signatures are also invariant to many view point effects. As such they can be shared with cameras
and agents with different points of view.
classify each of the 600 features into distinct classes. For this we use the non-parametric
classifier NPclassify mentioned above. This classifier classifies each feature location
without needing to know a priori the number of object feature classes or how many
samples should fall into each class. It forms classes by weighting each feature vector
from each feature location by its distance to every other point. It then can link each
feature location to another, which is the closest feature location that has a higher weight.
This causes points to link to more central points. Where a central point links to another
cluster it is not a member of, we tend to find that the link is comparatively rather long.
33
We can use this to cut links, thus, creating many classes. In essence, feature vectors from
the image are grouped based on value proximity. As an example, two pixels that are close
to each other in an image and are both blue would have a greater tendency to be grouped
together than two pixels in an image that are far apart and are blue and yellow.
Once we have established what classes exist and which feature locations belong to
them, we can statistically analyze them to determine prior information that will be useful
to any tracker, which requires statistical prior information in order to track a target. Thus,
we create a signature for each class that describes the mean values for each feature type
as well as the standard deviation within that class. Additionally, since spatial locations
play a part in weighting feature vectors during clustering, feature vectors that are
classified in the same class tend to lie near each other. Thus, the signature can contain the
spatial location of the class as well. Figure 2.2 shows the flow from saliency to feature
classification and signature creation.
The signatures we derive from the feature properties of each class exist to serve
two purposes. The first is that it allows this complex tracker to build its own prior
awareness. When it classifies the next frame of video, it can try and match each of the
new objects it classifies as being the same object in the last frame. Thus, it is not just a
classifier, but it can track objects on its own for short periods. Further, we can use
information about targets to bias the classification process between frames. So for
instance, we would expect that the second frame of video in a sequence should find
objects which are similar to the first frame. As such, each classified object in any given
frame, biases the search in the next frame, by weighting the classifier towards finding
objects of those types.
34
While this seems very complex, signature creation is fairly quick, saliency
computation is done in real time on eight 733 MHz Pentium III computers in a Beowulf
cluster. The rest of the code runs in under 60 ms on an Opteron 150 based computer. This
means we can do weak meta-prior classification and extraction of signatures at around >
15 frames per second.
2.2.1 Complex Feature Tracker Components
2.2.1.1 Visual Saliency
The first stage of processing is finding which locations in an image are most
salient. This is done using the saliency program created by (Itti & Koch, 2001b), which
works by looking for certain types of uniqueness in an image (Figure 2.3). This simulates
the processing in visual cortex that the human brain performs in looking for locations in
an image, which are most salient. For instance, a red coke can placed among green
foliage would be highly salient since it contrasts red against green. In essence, each pixel
in an image can be analyzed and assigned a saliency value. From this a saliency map can
be created. The saliency map simply tells us the saliency of each pixel in an image.
2.2.1.2 Monte Carlo Selection
The saliency map is taken and treated as a statistical map for the purpose of
Monte Carlo selection. The currently used method will extract a specified number of
features from an image. Highly salient locations in an image have a much higher
probability of being selected than regions of low saliency. Additionally, biases from other
modules may cause certain locations to be picked over consecutive frames from a video.
For instance, if properties of a feature vector indicate it is very useful, then it makes sense
35
to select from a proximal location in the next frame. Thus, the saliency map combines
with posterior analysis to select locations in an image which are of greatest interest.
Figure 2.3: The complete VFAT tracker is a conglomeration of different modules that select features from
an image, mix them into more complex features and then tries to classify those features without strong
meta-priors for what kind of features it should be looking for.
36
2.2.1.3 Mixing Modules
2.2.1.3.1 Junction and End Stop Extraction
Figure 2.4: Saliency is comprised of several channels which process an image at a variety of different
scales and then combine those results into a saliency map.
During the computation of visual saliency, orientation filtered maps are created.
These are the responses of the image to Gabor wavelet filters. These indicate edges in the
image. Since each filter is tuned to a single preferred orientation, a response from a filter
indicates an edge that is pointed in the direction of preference. The responses from the
filters are stored in individual feature maps. One can think of a feature map as simply an
image which is brightest where the filter produces its highest response. Since the feature
37
maps are computed as part of the saliency code, re-using them can be advantageous from
an efficiency standpoint. From this we create feature maps to find visual junctions and
end-stops in an image by mixing the orientation maps (Figure 2.4). We believe such new
complex feature maps can also tell us about the texture at image locations which can help
give us the gist of objects to be tracked.
The junction and end stop maps are computed as follows. Note that this is a
different computation then the one used in appendix D and chapter 5 in the attention
gate model. At some common point i,j on the orientation maps P the filter responses from
the orientation filters are combined. Here the response to an orientation in one orientation
map ij p is subtracted from an orthogonal map’s orientation filter output orth
ij p and divided
by a normalizer n which is the max value for the numerator. For instance, one orientation
map that is selective for 0 degree angles is subtracted from another map selective for 90
degree angles. This yields the lineyness of a location in an image because where
orthogonal maps overlap in their response is at the junctions of lines.
(2.1) ; {1,2}
orth
k ij ij
ij
p p
a k
n
−
= ∈
We then compute a term (2.2) which is the orthogonal filter responses summed. This is
nothing more than the sum of the responses in two orthogonal orientation maps.
38
Figure 2.5: The three images on the right are the results of the complex junction channel after ICA/PCA
processing from the original image on the left. As can be seen it does a reasonable job of finding both
junctions and end stops.
(2.2) ; {1, 2}
orth
k ij ij
ij
p p
b k
n
+
= ∈
The individual line maps are combined as:
(2.3)
1 2
ij ij
ij
a a
n
α
+
=
This gives the total lineyness for all orientations. We then do a similar thing for our total
response maps:
(2.4)
1 2
ij ij
ij
b b
n
β
−
=
The final junction map γ is then computed by subtracting the lineyness term from the
total output of the orientation filters:
(2.5) ij ij ij γ =α − β
39
Since the junction map is computed by adding and subtracting orientation maps which
have already been computed during the saliency computation phase, we gain efficiency
we wouldn’t have had if we were forced to convolve a whole new map by a kernel filter.
Thus, this junction filter is fairly efficient since it does not require any further
convolution to compute. Figure 2.5 shows the output and it can be seen that it is effective
at finding junctions and end-stops.
2.2.1.3.2 ICA/PCA
We decrease the dimensionality of each feature vector by using a combination of
Independent Component Analysis (ICA) (Bell & Sejnowski, 1995) and Principle
Component Analysis (PCA) (Jollife, 1986). This is done using FastICA (Hyvärinen,
1999) to create ICA un-mixing matrices offline. The procedure for training this is to
extract a large number of features from a large number of random images. We generally
use one to two hundred images and 300 points from each image using the Monte Carlo
selection processes just described. FastICA first determines the PCA reduction matrix
and then determines the matrix that maximizes the mutual information using ICA.
Unmixing matrices are computed for each type of feature across scales. So as an
example, the red-green opponent channel is computed at different scales, usually six.
PCA/ICA will produce a reduced set of two opponent maps from the six original scale
maps (This is described in detail later and can be seen in figure 2.7). Using ICA with
PCA helps to ensure that we not only reduce the dimension of our data set, but that the
information sets are fairly unique. From the current data, we reduce the total number of
dimensions with all channels from 72 to 14 which is a substantial efficiency gain
40
especially given the fact that some modules have complexity O(d2) for d number of
feature channels (dimensions).
Figure 2.6: NPclassify works by (A) first taking in a set of points (feature vectors) (B) then each point is
assigned a density which is the inverse of the distance to all other points (C) Points are then linked by
connecting a point to the nearest point which has a higher density (D) Very long links (edges) are cut if
they are for instance statistically longer than most other links. This creates separate classes.
2.2.1.4 Classification Modules
2.2.1.4.1 Classification of Features with NPclassify5
Features are initially classified using a custom non-parametric clustering
algorithm called NPclassify6. The idea behind the design of NPclassify is to create a
5 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/NPclassify2.C/.H
6 A description and additional information on top of what will be discussed can be found at:
http://www.nerd-cam.com/cluster-results/.
41
clustering mechanism which has soft parameters that are learned and are used to classify
features. We define here soft parameters as values which define the shape of a meta-prior.
This might be thought of as being analogous to a learning rate parameter or a Bayesian
hyperparameter. For instance, if we wanted to determine at which point to cut off a
dataset and decided on two standard deviations from the mean, two standard deviations
would be a soft parameter since the actual cut off distance depends on the dataset.
NPclassify (Figure 2.2, 2.6 and 2.7) (Mundhenk et al., 2004b, Mundhenk et al.,
2005a) works by using a kernel to find the density at every sample point. The currently
used kernel does this by computing the inverse of the sum of the Euclidian distance from
each point to all other points. After density has been computed the sample points are
linked together. This is done by linking each point to the closest point which has a higher
density. This creates a path of edges which ascends acyclically along the points to the
point in the data set which has the highest density of all. Classes are created by figuring
out which links need to be cut. For instance, if a link between two sample points is much
longer than most links, it suggests a leap from one statistical mode to another. This then
may be a good place to cut and create two separate classes. Additionally, classes should
be separated based upon the number of members the new class will have. After classes
have been created, they can be further separated by using interclass statistics.
The advantage to using NPclassify is that we are not required to have a prior
number of classes or any prior information about the spatial or sample sizes of each class.
42
Figure 2.7: On the left are samples of features points with the class boundaries NPclassify has discovered.
Some of the classes have large amounts of noise while others are cramped together rather than being
separated by distance. On the right are the links NPclassify drew in order to create the clusters. Red links
are ones which are too long and were clipped by the algorithm to create new classes.
43
Instead, the modal distribution of the dataset combined with learned notions of feature
connectedness determine whether a class should be created. So long as there is some
general statistical homogeneity between training and testing datasets we should expect
good performance for clustering based classification. The training results are discussed
later in the section on training results.
Figure 2.8: The results using NPclassify are shown next to the same results for k-means on some sham
data. The derived clusters are shown with the Gaussian eignenmatrix bars (derived using the eigenmatix
estimation in section 2.2.1.4.2). In general, NPclassify creates more reliable clusters particularly in the
presence of noise. Additionally, it does so without needing to know a priori how many classes one has.
As such, we do have a few meta-priors still present. The first is a basic kernel
parameter for density. In this case, the Euclidian distance factor makes few assumptions
44
about the distribution other than that related features should clump together. The second
meta-prior is learned as a hyperparameter for a good cutoff. This can be derived using
practically any gradient optimization technique. So it is notable, that NPclassify is not
without some type of prior, but the assumptions on the data is quite relaxed and only
assumes that related feature samples will be close to each other in feature space. An
example of NPclassify working on somewhat arbitrary data points can be seen in figure
2.8.
2.2.1.4.2 Gaussian Generalization and Approximation7
In order to store classes for future processing it is important to generalize them.
Gaussian ellipsoids are used since their memory usage for any class is O(d2) for d number
of dimensions for a given class. Since d is fairly low for us, this is an acceptable
complexity. Additionally, by using Gaussians we gain the power of Bayesian inference
when trying to match feature classes to each other. However, the down side is that
computing the eigen matrix necessary for Gaussian fitting scales minimally as d3 for
dimensions and s2 for the number of samples. That is, it is O(d3 + s2). This is due to the
fact that computing such elements using the pseudo inverse method (or QR
decomposition) involves matrix inversion and multiplication. In order to avoid such large
complexity we have implemented an approximation technique that scales minimally as d2
for dimensions and s for the number of samples - O(sd2). This means that a net savings
happens if the number of samples is much larger than the number of dimensions. So for
7 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/covEstimate.C/.H
45
instance, if there are more than 100 samples and only 10 dimensions, this will produce a
savings over traditional methods.
Figure 2.9: After NPclassify has grouped feature samples together they can be fit with Gaussian
distributions. This helps to determine the probability that some new feature vector belongs to a given class
or that two classes compute in consecutive frames using NPclassify are probably the same class. If the
distributions overlap greatly as on the left figure, then two classes are probably the same class.
The approximation method works by using orthogonal rotations to center and
remove covariance from the data. By recording the processes, we can then compute the
probability on data points by translating and transforming them in the same way to align
with the data set. What we want to be able to do is to tell the probability of data points
belonging to some class as well as being able to tell if two classes derived in consecutive
frames are probably the same class (see figure 2.9)
The first step is to center the data about the origin. This is done by computing the
mean and then subtracting that number from each feature vector. Next we compute
approximate eigenvectors by trying to find the average vector from the origin to all
46
feature vector coordinates. So for k th feature vector, we first compute the ratio between
its distance l from the origin along dimensions j and i. This yields the ratio rijk. That is,
after aligning the feature vector with the origin, we take the ratio of two features in the
same vector (we will do this for all possible feature pairs in the vector).
(2.6) jk
ijk
ik
l
r
l
=
Next we find the Euclidian distance uijk from the origin along dimensions j and i.
(2.7) 2 2
uijk = lik −ljk
By Summing the ratio of rijk and uijk for all k feature vectors, we obtain a mean ratio that
describes the approximated eigenvector along the dimensions i and j.
(2.8)
0
k
ijk
ij
k ijk
r
m
= u
= Σ
A normalizer is computed as the sum of all the distances for all samples k.
(2.9)
0
k
ij ijk
k
n u
=
= Σ
Next we determine the actual angle of the approximated eigenvector along the
dimensions i and j.
(2.10) tan 1 ij
ij
ij
m
n
θ − ⎛ ⎞
= ⎜⎜ ⎟⎟
⎝ ⎠
47
Once we have that, we can rotate the data set along that dimension and measure
the length of the ellipsoid using a basic sum of squares operation. Thus, we compute ρik
and ρjk which is the data set rotated by θij. Here ξ is the positions of kth feature vector
along the i dimension and ψ is the position of the feature vector along the j dimension.
What we are doing here is rotating covariance out along each dimension so that we can
measure the length of the eigenvalue. Thus, we iterate over all data points k and along all
dimensions i and along i+1 dimensions j summing up σ as we go. We only sum j for i+1
since we only need to use one triangle of the eigenvector matrix since it is symmetric
along the diagonal.
(2.11) i +1 ≤ j
(2.12) cos( ) sin( ) ik ij ij ρ =ξ ⋅ θ +ψ ⋅ θ
(2.13) sin( ) cos( ) jk ij ij ρ = −ξ ⋅ θ +ψ ⋅ θ
What we have done is figure out how much we need to rotate the set of feature vectors in
order to align the least squares slope with the axis. Once this is done, we can rotate the
data set and remove covariance. Since the mean is zero because we translated the data set
by the mean to the origin, variance for the sum of squares is computed simply as:
(2.14)
2
0
k
ik
iij
k
s
n
ρ
=
= Σ
(2.15)
2
0
k
jk
jji
k
s
n
ρ
=
= Σ
48
Each sum of squares is used to find the eigenvalue estimate by computing
Euclidian distances. That is, by determining the travel distance of each eigenvector
during rotation and combining that number with the computed sum of squares we can
determine an estimate of the eigenvalue from triangulation. The conditional here is used
because σii is computed more than once with different values for θij. Thus, σii is the sum
of all the products of θij and siij.
(2.16) ( )2
iff = 0
cos( ) otherwise
iij ii
ii
ii iij ij
s
s
σ
σ
σ θ
⎧⎪
= ⎨
+ ⋅ − ⎪⎩
(2.17) ( )2
iff = 0
cos( ) otherwise
jji jj
jj
jj jji ij
s
s
σ
σ
σ θ
⎧⎪
= ⎨
+ ⋅ − ⎪⎩
The end result is a non-standard eigenmatrix which can be used to compute the
probability that a point lies in a Gaussian region. We do this by performing the same
procedure on any new feature vector. That is, we take any new feature vector and replay
the computed translation and rotations to align it with covariance neutral eigenmatrix
approximation. Probability for the feature vector is then computed independently along
each dimension thus eliminating further matrix multiplication during the probability
computation.
To summarize, by translating and rotating the feature set, we have removed
covariance so we can compute probabilities assuming dimensions do not interact. In
essence this removes the need for complex matrix operations. While the complexity is
high, it is one order lower than the standard matrix operations as was mentioned earlier.
49
Examples of fits created using this method can be seen in figure 2.7 where NPclassify
has created classes and the eigenmatrix is estimated for the ones created.
2.2.1.4.3 Feature Contiguity, Biasing and Memory
Once features have been classified we want to use them to perform various tasks.
These include target tracking, target identification and feature biasing. Thus from a
measurement of features from time t, we would like to know if a collection of features at
time t+1 is the same, and as such either the same object or a member of the same object.
By using Bayesian methods we can link classes of features in one frame of a
video to classes in the next frame by tying a class to another which is its closest
probabilistic match. Additionally, we use the probability to bias how the non-parametric
classifier and saliency work over consecutive frames. For NPclassify we add a sink into
the density computation. That is, we create a single point whose location is the mean of a
class with the mass of the entire class. Think of this as dropping a small black hole in a
galaxy that represents the mass of the other class. By inserting this into the NPclassify
computation, we skew the density computation towards the prior statistics in the last
iteration. This creates a Kalman filter like effect that smoothes the computation of classes
between frames. This is a reasonable action since the change in features from one frame
to the next should be somewhat negligible.
2.2.1.5 Complex Feature Tracker Methods and Results
2.2.1.5.1 Complexity and Speed
One of the primary goals of VFAT is that it should be able to run in real time.
This means that each module should run for no more than about 30 ms. Since we are
using a Beowulf cluster, we can chain together modules such that even if we have several
50
steps that take 30 ms each, by running them on different machines we can create a vision
pipeline whereby a module finishes a job and hands the results to another machine in a
Beowulf cluster that is running the next process step.
In time trials the modules run within real time speeds. Using a Pentium 4 2.4
GHz Mobile Processor with 1 GB of RAM, each module of VFAT runs at or less than 30
ms. The longest running module is the NPclassify feature classifier. If given only 300
features it runs in 23 ms, for 600 features it tends to take as long as 45 ms. On a newer
system it should be expected to run much faster.
2.2.1.5.2 Training for Classification
Table 2.1: Following PCA the amount of variance accounted for was computed for each type of feature
channel. Each channel started with six scales (dimensions). For many channels, 90% of variance is
accounted for after a reduction to two dimensions. For all others, no more than three dimensions are needed
to account for 90% of variance.
Two modules in VFAT need to be trained prior to usage. These include ICA/PCA
and NPclassify. Training for both has been designed to be as simple as possible in order
to maintain the ease of use goal of the iRoom project. Additionally and fortunately,
training of both modules is relatively quick with ICA/PCA taking less than a minute
using the FastICA algorithm under Matlab and NPclassify taking around two hours using
51
gradient descent training. Since we only need to ever train once, this is not a prohibitive
amount of time.
2.2.1.5.3 Training ICA/PCA
Figure 2.10: The various conspicuity maps of the feature channels from the saliency model are shown here
ICA/PCA reduced.
Training was completed by using 145 randomly selected natural images from a
wide range of different image topics. Images were obtained as part of generic public
domain CD-ROM photo packages, which had the images sorted by topic. This enabled us
to ensure that the range of natural images used in training had a high enough variety to
prevent bias towards one type of scene or another. For each image, 300 features were
extracted using the Monte Carlo / Visual saliency method described earlier. In all this
52
gave us 43,500 features to train ICA/PCA on. The results are shown on table 2.1. For
most channels, a reduction to two channels from six still allowed for over 90% of
variance to be accounted for. However, directional channels that measure direction of
motion and orientation of lines in an image needed three dimensions to still account for
more than 90% of all variance. Assuming that the data is relatively linear and a good
candidate for PCA reduction, this suggests that we can effectively reduce the number of
dimensions to less than half while still retaining most of the information obtained from
feature vectors.
Visual inspection of ICA/PCA results seems to show the kind of output one
would expect (Figure 2.10 and 2.11). For instance, when two channels are created from
six, they are partially a negation to each other. On the red/green channel, one of the
outputs seems to show a preference for red. However, the other channel does not
necessarily show an anti-preference for red. This may suggest that preferences for colors
may also depend on the scales of the images. That is, since what makes the six input
images to each channel different is the scale at which they are processed, scale is the
most likely other form of information processed by ICA/PCA. This might mean for
instance that the two channels of mutual information contain information about scaling.
We might guess that of the two outputs from the red/green channel, one might be a
measure of small red and the other of large green things. If this is the case it makes sense
since in nature, red objects tend to be small (berries, nasty animals, etc.) while green
things tend to be much more encompassing (trees, meadows, ponds).
53
Figure 2.11: From the original image we see the results of ICA/PCA on the red/green and blue/yellow
channels. As can be seen some parts of the outputs are negations of each other which makes sense since
ICA maximizes mutual information. However, close examination shows they are not negatives. It is
possible that scale information applies as a second input type and prevents obvious negation.
2.2.1.5.4 Training NPclassify
To hone the clustering method we use basic gradient decent with sequential
quadratic programming using the method described by (Powell, 1978). This was done
offline using the Matlab Optimization Toolbox. For this study, error was defined as the
number of classes found versus how many it was expected to find (see Figure 2.12).
Thus, we presented the clustering algorithm with 80 natural training images. Each image
54
Figure 2.12: In this image there are basically three objects. NPclassify has found two (colors represent the
class of the location). This is used as the error to train it. So for 80 images it should find x number of
objects. The closer it gets to this number, the better. Notice that the points are clustered in certain places.
This is due to the saliency/Monte Carlo method used for feature selection.
had a certain number of objects in it. For instance an image with a ball and a wheel in it
would be said to have two objects. The clustering algorithm would state how many
classes it thought it found. If it found three classes in an image with two objects then the
error was one. The error was computed as average error from the training images. The
training program was allowed to adjust any of several hard or soft parameters for
NPclassify during the optimization.
The training data was comprised of eight base objects of varying complexity such
as balls and a wheel on the simple side or a mini tripod and web cam on the more
55
complex side. Objects were placed on a plain white table in different configurations.
Images contained different numbers of objects as well. For instance some images
contained only one object at a time, while other contained all eight. A separate set of
validation images was also created. These consisted of a set of eight different objects
with a different lighting created by altering the f-stop on the camera. Thus, the training
images were taken with an f-stop of 60 while the 83 validation images were taken with an
f-stop of 30. Additionally, the angle and distance of view point is not the same between
the training and validation sets. The validation images were not used until after optimal
parameters were obtained by the training images. Then the exact same parameters were
used for the validation phase.
Our first test was to examine if we could at the very least segment images such
that the program could tell which objects were different from each other. For this test
spatial interaction was taken into account. We did this by adding in spatial coordinates as
two more features in the feature vectors with the new set of 14 ICA/PCA reduced feature
vectors. The sum total of spatial features were weighted about the same as the sum total
of non-spatial features. As such, the membership of an object in one segmented class or
the other was based half by its location in space and half by its base feature vector
composition. Reliability was measured by counting the number of times objects were
classified as single objects, the number of times separate objects were merged as one
object and the number of time a single object was split into two unique objects.
Additionally, there was a fourth category for when objects were split into more than three
objects. This was small and contained only four instances.
56
The results were generally promising in that based upon simple feature vectors
alone, the program was able to segment objects correctly with no splits or merges in 125
out of the 223 objects it attempted to segment. In 40 instances an object was split into two
objects. Additionally 54 objects were merged as one object. While on the surface these
numbers might seem discouraging there are several important factors to take into
account. The first is that the program was segmenting based solely on simple features
vectors with a spatial cue. As such it could frequently merge one shiny black object into
another shiny black object. In 62 % of the cases of a merger, it was obvious that the
merged objects were very similar with respect to features.
2.2.1.5.5 NPclassify v. K-Means
NPclassify was also tested on its general ability to classify feature clusters. In this
case it was compared with K-means. However, since K-means requires the number of
classes to be specified a priori, this was provided to it. So in essence, the K-means
experiment had the advantage of knowing how many classes it would need to group,
while NPclassify did not.
The basic comparison test was similar to the test presented in the previous section.
In this case, several Gaussian like clusters were created of arbitrary 2 dimensional
features. They had between 1 and 10 classes in each data set. 50 of the sets were clean
with no noise such that all feature vectors belonged explicitly to a ground truth class.
However, in 50 other sets, small amounts of random noise were added. The comparison
metric for K-means and NPclassify was how often classes were either split or merged
57
when they should have not been. The mean error for both conditions is shown below in
figure 2.13. It should be noted that while K-means may be sensitive to noise in data, it is
used here since it is well known and can serve as a good base line for any clustering
algorithm.
Figure 2.13: NPclassify is compared with K-Means for several data sets. The error in classification for
different sets is the same if there is little noise in the data. However, after injecting some noise, NPclassify
performs superior.
The general conclusion is that compared with K-means, NPclassify is superior
particularly when there is noise in the data. This is not particularly surprising since as a
spanning tree style algorithm, NPclassify can ignore non proximal data points much more
easily. That is, K-means is forced to weigh in all data points and really has no innate
ability to determine that an outlying data point should be thrown away. However,
NPclassify will detect the jump in distance to an outlier or noise point from the central
density of the real class.
58
2.2.1.5.6 Contiguity
Figure 2.14: Tracking from frame 299 to frame 300 the shirt on the man is tracked along with the head
without prior knowledge of what is to be tracked. It should be noted that that while the dots are drawn in
during simulation, the ellipses are drawn in by hand for help in illustration in gray scale printing.
Contiguity has been tested but not fully analyzed (Figure 2.14). Tracking in video
uses parameters for NPclassify obtained in section 2.2.1.5.4. Thus, the understanding of
how to track over consecutive frames is based on the computers subjective understanding
of good continuity for features. In general, classes of features can be tracked for 15 to 30
frames before the program loses track of the object. This is not an impressive result in
and of itself. However, several factors should be noted. First is that each object that
VFAT is tracking is done so without priors for what the features of each should be. Thus,
the program is tracking an object without having been told to either track that object or
what the object its tracking should be like. The tracking is free form and in general
without feature based priors. The major limiter for the contiguity of tracking is that an
object may lose saliency as a scene evolves. As such an object if it becomes too low in
saliency will have far fewer features selected for processing from it, which destroys the
track of an object with the current feature qualities. However, as will be noted in the next
59
section, this is not a problem since this tracker is used to hand off trackable objects to a
simple tracker which fixates much better on objects to be tracked.
2.3 The Simple Feature Based Tracker8
Figure 2.15: The Simple tracker works by taking in initial channel values such as ideal colors. These are
used to threshold an image and segment it into many candidate blobs. This is done by connecting pixels
along scan lines that are within the color threshold. The scan lines are then linked which completes a
contiguous object into a blob. The blobs can be weeded out if they are for instance too small or too far from
where the target last appeared. Remaining blobs can then be merged back and analyzed. Finding the center
of mass of the left over blobs gives us the target location. By finding the average color values in the blob,
we can define a new adapted color for the next image frame. Thus, the threshold color values can move
with the object.
Once a signature is extracted using the complex tracker described in the previous
section, it can be feed to a faster and simpler tracking device. We use a multi channel
8 For more information see also: http://ilab.usc.edu/wiki/index.php/Color_Tracker_How_To
60
tracker, which uses color thresholding to find candidate pixels and then links them
together. This allows it to not only color threshold an image, but to segregate blobs and
analyze them separately. So for instance, if it is tracking a yellow target, if another yellow
target appears, it can distinguish between the two. Additionally, the tracker also computes
color adaptation as well as adaptation over any channel it is analyzing. We compute for
instance a new average channel value c (2.18) as the sum of all pixel values in this
channel p c over all N pixels in tracked ‘OK’ blobs (as seen in figure 2.15) p from the
current frame t to some past frame t′ . In basic terms, this is just the average channel
value for all the trackable pixels in several consecutive past frames. Additionally we
computeσ , which is just the basic standard deviation over the same pixels.
(2.18)
( )2
0 and 0
1
t N t N
ip ip
i t p i t p
t t
i i
i t i t
c cc
c
N N
σ = ′ = = ′ =
= ′ = ′
−
= =
−
ΣΣ ΣΣ
Σ Σ
Currently, we set a new pixel as being a candidate for tracking if for all channels
that have a pixel value p c :
(2.19) p c −α ⋅σ ≤ c ≤ c +α ⋅σ
Thus, a pixel is thresholded and selected as a candidate if it falls within the
boundary of each channel that is its mean value computed from eq. (2.18) +/- the product
of the standard deviation and a constantα . Forgetting is accomplished in the adaptation
by simply windowing the sampling interval.
61
This method allows the tracker to track a target even if its color changes due to
changes in lighting. It should be noted that the simple tracker can track other features in
addition to color so long as one can create a channel for it. That is, an RGB image can be
separated into three channels, which are each gray scale images. In this case, we create
one for red, one for green and one for blue. We can also create images that are for
instance, the responses of edge orientation filters or motion filters. These can be added as
extra channels in the simple tracker in the same manner. However, to preserve luminance
invariance we use the H2SV color scheme described in appendix G. This is just an
augmentation of HSV color space that solves for the singularity at red by converting hue
into Cartesian coordinates.
In addition to the basic vision functional components of the simple tracker, its
code design is also important. The tracker is object oriented which makes it easy to create
multiple independent instances of the simple tracker. That is, we can easily run several
simple trackers on the same computer each tracking different objects from the same video
feed. The computational work for each tracker is fairly low and four independent trackers
can simultaneously process 30 frames per second on an AMD Athlon 2000 XP processor
based machine. This makes it ideal for the task of tracking multiple targets at the same
time.
2.4 Linking the Simple and Complex Tracker
In order for the simple tracker and the complex tracker to work together they have
to be able to share information about a target. As such the complex tracker must be able
to extract information about objects that is useful to the simple tracker (Figure 2.16).
62
Additionally, linking the simple tracker with the complex tracker creates an interesting
problem with resource allocation. This is because each simple tracker we instantiate
tracks one target at a time while the complex tracker has no such limit. A limited number
of simple trackers can be created and there must be some way to manage how they are
allocated to a task based on information from the complex tracker.
Figure 2.16: The simple and complex trackers are linked by using the complex tracker to notice and
classify features. The complex tracker then places information about the feature classes into object feature
class signatures. The complex tracker uses these signatures to keep track of objects over several frames or
to bias the way in which it classifies objects. The signatures are also handed to simple trackers, which track
the objects with greater proficiency. Here we see two balls have been noticed and signatures have been
extracted and used to assign each ball to its own tracker. The smaller target boxes on the floor show that the
simple tracker was handed an object (the floor), which it does not like and is not tracking. Thus, the simple
tracker has its own discriminability as was mentioned in section 2.3 and figure 2.15.
We address the first problem by making sure both trackers work with similar
feature sets. So for example, the complex tracker when it runs will examine the H2SV
color of all the classes it creates. It then computes the mean color values for each class.
This mean color value along with the standard deviation of the color can then be handed
to the simple tracker, which uses it as the statistical prior color information for the object
it should track.
63
Figure 2.17: This is a screen grab from a run of the combined tracker. The lower left two images show the
complex tracker noticing objects, classifying and tracking them. The signature is handed to the simple
tracker, which is doing the active tracking in the upper left window. The combined tracker notices the man
entering the room and tracks him without a priori knowledge of how he or the room looks. Once he walks
off the right side, the tracker registers a loss of track and stops tracking. The bars on the right side show the
adapted actively tracked colors from the simple tracker in H2SV color. The lower right image shows that
many blobs can fit the color thresholds in the simple tracker, but most are selected out for reasons such as
expected size, shape and position.
The second issue of resource allocation is addressed less easily. However, there
are simple rules for keeping resource allocation under control. First, don’t assign a simple
tracker to track an object that overlaps with a target another simple tracker is tracking in
the same camera image. Thus, don’t waste resources by tracking the same target with two
or more trackers. Additionally, since the trackers are adaptive we can find that two
trackers were assigned to the same target, but we didn’t know this earlier. For instance, if
64
accidentally one simple tracker is set to track the bottom of a ball and one the top of the
ball, after a few iterations of adaptation, both trackers will envelop the whole ball. It is
thus advantageous to check for overlap later. If we find this happening, we can dismiss
one of the simple trackers as redundant. Additionally, our finite resources mean we do
not assign every unique class from the complex tracker to a simple tracker. Instead, we
try and quantify how interesting a target is. For instance, potential targets for the simple
tracker may be more interesting if they are moving, have a reasonable mass or have been
tracked by the complex tracker for a long enough period of time.
2.5 Results
On the test videos used, the system described seems to work very well. A video of
a man entering and leaving a room (Figure 2.17) was shown five times to the combined
complex and simple tracker. In each run, the man was noticed within a few frames of
entering the cameras view. This was done without prior knowledge of how the target
should appear and without prior knowledge of the room’s appearance. The features were
extracted and a simple tracker was automatically assigned to track the man, which did so,
until he left the room, at which point the simple tracker registered a loss of track.
Interestingly enough, the tracker extracted a uniform color over both the man’s shirt and
his skin. It was thus able to, on several instances, track the man as both his shirt and his
skin. Thus, even though the shirt was burgundy and the skin reddish, the combined
tracker was able to find a statistical distribution for H2SV color that encompassed the
color of both objects as unique from the color of objects in the rest of the room.
65
The tracker was also tested on a video where a blue and yellow ball both swing on
a tether in front of the camera. In five out of five video runs, both balls are noticed and
their features extracted. Each ball is tracked as a separate entity by being assigned by the
program its own simple tracker. Each ball is tracked until it leaves the frame, at which
point the simple trackers register a loss of track. The balls even bounce against each
other, which demonstrates that the tracker will trivially discriminate between objects even
when they are touching or overlapping.
In both video instances, objects are tracked without the program knowing the
features of the object to be tracked a priori. Instead, saliency is used to notice different
possible targets and the complex tracker is used to classify possible targets into classes.
This was then used to hand target properties to the simple trackers as automatically
generated prior information about the targets to be tracked. Additionally, the simple
tracker will register a loss of track when the target leaves the field of view. This allows us
to not only notice when a new target enters our field, but also when it leaves.
The tracking was also aided by the use of H2SV color. Prior to using the H2SV
color scheme, the purple shirt the man is wearing was split as two objects since the color
of many of the pixels bordered on and even crossed into the red part of the hue spectrum.
Thus, standard HSV created a bi-modal distribution for hue. The usage of H2SV allowed
us to now track the purple shirt as well as objects that are reddish in hue, such as skin.
H2SV color also works for tracking of objects in the center of the spectrum, which we
observed by tracking objects that are green, yellow and blue.
In addition to tracking using a static cameral, the same experiment was done using
a moving camera. This is much less trivial since the common method of eigen
66
background subtraction cannot be used to determine new things in a scene from the
original scene. Again the tracker was able to track a human target without prior
knowledge of features even as the camera moved. This is a distinct advantage for our
tracker and illustrates the advantage of using saliency to extract and bind features since it
can compensate for global motion.
2.6 Discussion
2.6.1 Noticing
The most notable and important aspect of the current work is that we are able to
track objects or people without knowing what they will look like a priori and we are able
to do so quickly enough for real time applications. Thus, we can notice, classify and track
a target fairly quickly. This has useful applications in many areas and in particular
security. This is because we track something based on how interesting it is and not based
on complete prior understanding of its features. Potentially, we can then track any object
or person even if they change their appearance. Additionally, since we extract a signature
that describes a target that is viewpoint invariant, this information can be used to share
target information with other agents.
2.6.2 Mixed Experts
Additionally, we believe we are demonstrating a better paradigm in the
construction of intelligent agents, one that uses a variety of experts to accomplish the
task. The idea is to use a variety of solutions that work on flexible weak meta-prior
information, but then use their output as information for a program that is more biased.
67
This is founded on the idea that there is no perfect tool for all tasks and that computer
vision is comprised of many tasks such as identification, tracking and noticing. To
accomplish a complex task of noticing and tracking objects or people, it may be most
optimal to utilize many different types of solutions and interact them. Additionally, by
mixing experts in this way, no one expert necessarily needs to be perfect at its job. If the
experts have some ability to monitor one another, then if one expert makes a mistake, it
can possibly be corrected by another expert. It should be noted that this tends to follow a
biological approach in which the human brain may be made up of interacting experts, all
of which are interdependent on other expert regions in order to complete a variety of
tasks.
Another important item to note in the mixed experts paradigm is that while it may
make more intuitive sense to use such an approach, new difficulties arise as our system
becomes more abstractly complex. So as an example, if one works with support vector
machines only, then one has the advantage of a generally well-understood mathematical
framework. It is easier to understand a solutions convergence, complexity and stability in
a system if it is relatively homogeneous. When a person mixes experts, particularly if the
experts act very differently, the likelihood of the system doing something unexpected or
even catastrophic tends to increase. Thus, when one designs an intelligent agent with
mixed experts, system complexity should me managed carefully.
2.6.3 System Limitations and Future Work
The system described has its own set of limitations. The work up to this point has
concentrated on being able to notice and track objects in a scene quickly and in real time.
68
However, its identification abilities are still somewhat limited. It does not contain a
memory such that it can store and identify old targets in the long term. However, such an
ability is in the works and should be aided by the ability of the tracking system to narrow
the area of the image that needs to be inspected which should increase the speed of visual
recognition
69
Chapter 3: Contour Integration and Visual Saliency
In the visual world there are many things, which we can see, but certain features, sets of
features and other image properties tend to more strongly draw our visual attention
towards them. A very simple example is a stop sign, in which combinations of red color
and angular features of an octagon combine with a strong word “stop” to create
something that hopefully we would not miss if we come upon it. Such propensity of some
visual features to attract attention defines in part the phenomenon of visual saliency. Here
we assert, as others (James, 1890, Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti
& Koch, 2001b) that saliency is drawn from a variety of factors. At the lowest levels,
color opponents, unique orientations and luminance contrasts create the effect of visual
pop-out (Treisman & Gelade, 1980, Wolfe, O'Neill & Bennett, 1998). Importantly, these
studies have highlighted the role of competitive interactions in determining saliency ---
hence, a single stop sign on a natural scene backdrop usually is highly salient, but the
saliency of that same stop sign and its ability to draw attention is strongly reduced as
many similar signs surround it. At the highest levels it has been proposed

What draws in human attention and can we create computational models of it which work the same way? Here we explore this question with several attentional models and applications of them. They are each designed to address a missing fundamental function of attention from the original saliency model designed by Itti and Koch. These include temporal based attention and attention from non-classical feature interactions. Additionally, attention is utilized in an applied setting for the purposes of video tracking. Attention for non-classical feature interactions is handled by a model called CINNIC. It faithfully implements a model of contour integration in visual cortex. It is able to integrate illusory contours of unconnected elements such that the contours “pop-out” as they are supposed to and matches in behavior the performance of human observers. Temporal attention is discussed in the context of an implementation and extensions to a model of surprise. We show that surprise predicts well subject performance on natural image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an attention gate works in the human visual cortex. The attention gate derived from surprise also gives us a good idea of how visual information is passed to further processing in later stages of the human brain. It is also discussed how to extend the model of surprise using a Metric of Attention Gating (MAG) as a baseline for model performance. This allows us to find different model components and parameters which better explain the attentional blink in RSVP.

COMPUTATIONAL MODELING AND UTILIZATION OF ATTENTION,
SURPRISE AND ATTENTION GATING
by
Terrell Nathan Mundhenk
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2009
Copyright 2009 Terrell Nathan Mundhenk
ii
Epigraph
“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living;
It's a way of looking at life through the wrong end of a telescope. Which is what I do, and
that enables you to laugh at life's realities.”
Dr Seuss
iii
Dedication
For my parents Terry and Ann
iv
Acknowledgements
This really is the trickiest part to write because I want to thank so many people for
so many things. First off I would like to thank my sister Amy (Zon) who thought enough
of me to whip out a copy of the Schizophrenia paper I wrote with Michael Arbib when
she visited that research neurologist at the Cleveland Clinic. I thought that was funny, but
it made me feel as if I was doing something interesting. Then there are my closest friends
Paul Gunton, Brant Heflin, Mike Olson and Tim Olson who exhibited confidence in my
ability to actually complete this silly thing. I would also like to thank Kate Svyatets for
sticking with me through my thesis and mood swings. Life is meaningless without friends
and loved ones, so I am in your debt.
I would also like to acknowledge the excellent scientists I worked closely with
over the years, without whom my research would not have been possible. Firstly this
includes Michael Arbib who has essentially been a co-thesis advisor to me. I can hardly
communicate the enormous amount of things which I learned from him over the years.
His honesty and integrity were of great value and, I knew when he said something, he
really meant it. Next, I would like to thank Kirstie Bellman and Chris Landauer at the
Aerospace Corporation. They gave me more of an idea of what real engineering is about
than just about anyone. I also want to give a great big thanks to Wolfgang Einhäuser my
co-author on several publications. I don’t think I’ve ever worked with anyone so totally
on the ball. I also need to mention Ernest Hall who was my mentor during my
undergraduate years and source of encouragement during my graduate years. I don’t
think I would be in a research field if it wasn’t for him.
v
I also must extend my deepest gratitude to the many faculty members who have
provided excellent feedback and conversation over the years of my graduate education. I
cannot mention every teacher and mentor who touched my life over the past few years,
because there were so many. However I would like to extend special thanks to: Irving
Biederman, Christof Koch, Christoph von der Malsburg, Bartlett Mel and Stefan Schaal.
I would also like to acknowledge many of the students and post-doctoral
researchers I have collaborated with or were in general very helpful in assisting me with
my research through direct assistance or discussion. They are: Jeff Begley, James
‘Jimmy’ Bonaiuto, Mihail Bota, Vincent J. Chen, Aaron D'Souza, Nitin Dhavale, Lior
Elazary, Jacob Everist, Doug Garrett, Larry Kite, Hendrik Makaliwe, Salvador Marmol,
Thomas Moulard, Pradeep Natarajan, Vidhya Navalpakkam, Jan Peters, Rob Peters, Eric
Pichon, Jack Rininger, Christian Siagian, and Rand Voorhies. If I forgot to mention
anyone, I’m really quite sorry.
Lastly, but far from leastly, I would like to thank my Thesis Advisor Laurent Itti
for all his help, input and encouragement he has provided over the years. After my first
year at USC, I was getting kind of board being out of the research game. At the time, I
was taking robotics from Stefan Schaal. I talked to him about who was doing interesting
research in computer vision and he suggested I talk to a promising new faculty member. I
took him up on his advice which turned out to be an excellent decision. iLab was new and
only had a few students back then, now it so vibrant and full of life with so many
projects. I will surely miss Laurent and iLab and I am certain I will look back on these
days with great positive satisfaction.
vi
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List of Tables x
List of Figures xi
Abbreviations xv
Abstract xviii
Preface xix
About this thesis xix
Graduate works not included in this thesis xx
Other works of interest not included in this thesis xxi
Don’t read the whole thesis xxii
Chapter 1: A Brief Introduction to Vision and Attention 1
1.1 What Does our Brain Want to Look For? 5
1.2 How Does our Brain Search for What it Wants? 9
1.2.1 What’s a Feature? 9
1.2.2 How do we Integrate These Features? 12
1.2.3 Beyond the Basic Saliency Model 17
1.3 The Current State of Attention and Other Models 18
1.3.1 Top-Down Models 19
1.3.2 Other Contemporary Models of Saliency 20
1.3.3 The Surprise Model 22
Chapter 2: Distributed Biologically Based Real Time Tracking with Saliency
Using Vision Feature Analysis Toolkit (VFAT) 23
2.1.1 Vision, Tracking and Prior Information 23
2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors 26
2.1.4 The iRoom and Meta-prior Information 30
2.2 Saliency, Feature Classification and the Complex Tracker 31
2.2.1 Complex Feature Tracker Components 34
2.3 The Simple Feature Based Tracker 59
2.4 Linking the Simple and Complex Tracker 61
2.5 Results 64
2.6 Discussion 66
2.6.1 Noticing 66
vii
2.6.2 Mixed Experts 66
2.6.3 System Limitations and Future Work 67
Chapter 3: Contour Integration and Visual Saliency 69
3.1 Computation 75
3.2 The model 77
3.2.1 Features 77
3.2.2 The Process 83
3.2.3 Kernel 87
3.2.4 Pseudo-Convolution 91
3.3 Experiments 97
3.3.1 Local element enhancement 98
3.3.2 Non-local Element Enhancement 103
3.3.3 Sensitivity to Non-contour Elements 112
3.3.4 Real World Image Testing 118
3.4 Discussion 122
3.4.1 Extending Dopamine to Temporal Contours via TD (dimensions) 125
3.4.2 Explaining Visual Neural Synchronization with Fast Plasticity 126
3.4.3 Contours + Junctions, Opening a New Dimension on Visual Cortex 127
3.4.4 Model Limitations 128
3.5 Conclusion 129
Chapter 4: Using an Automatic Computation of an Image’s Surprise to
Predicts Performance of Observers on a Natural Image Detection Task 130
4.1.1 Overview of Attention and Target Detection 131
4.1.2 Surprise and Attention Capture 134
4.2 Methods 136
4.2.1 Surprise in Brief 136
4.2.2 Using Surprise to Extract Image Statistics from Sequences 139
4.3 Results 144
4.4 A Neural Network Model to Predict RSVP Performance 152
4.4.1 Data Collection 153
4.4.2 Surprise Analysis 154
4.4.3 Training Using a Neural Network 154
4.4.4 Validation and Results of the Neural Network Performance 156
4.5 Discussion 164
4.5.1 Patterns of the Two-Stage Model 165
4.5.2 Information Necessity, Attention Gating and Biological Relevance 169
4.5.3 Generalization of Results 173
4.5.4 Comparison with Previous RSVP Model Prediction Work 173
4.5.5 Network Performance 174
4.5.6 Applications of the Surprise System 175
4.6 Conclusion 176
viii
Chapter 5: Modeling of Attentional Gating using Statistical Surprise 177
5.1 From Surprise to Attentional Gating 180
5.2 Methods 183
5.2.1 Paradigm 183
5.2.2 Computational Methods 184
5.3 Results 188
5.3.1 Relation of Results to Previous Studies Which Showed Causal Links
between Surprise and Target Detection 193
5.4 Discussion 196
5.4.1 Variability of the Attention Gate Size Fits within the Paradigm 196
5.4.2 The Attention Gate may Account for Some Split Attention Effects 198
5.4.3 Unifying Episodic Attention Gate Models with Saliency Maps 199
Chapter 6: A Comparison of Surprise Methods and Models Using the Metric of
Attention Gate (MAG) 201
6.1 The MAG Method for Comparison of Different Models 201
6.1.1 Fishers Linear Discriminant and Fitness 203
6.1.2 Data Sets Used 206
6.2 Comparison of Opponent Color Spaces using MAG 207
6.2.1 iLab RGBY 210
6.2.2 CIE Lab 211
6.2.3 iLab H2SV2 214
6.2.4 MAG Comparison of Color Spaces 214
6.3 Addition of Junction Feature Channels 216
6.4 Comparison of Different Statistical Models 217
6.5: Checking the Problem with Beta 219
6.5.1 Asymptotic Behavior of β 219
6.5.2 What Happens if We Fix the β Hyperparameter to a Constant Value? 221
6.5 Method Performance Conclusion 226
References 228
Appendices 245
Appendix A: Contour Integration Model Parameters 245
Appendix B: Mathematical Details on Surprise 246
Appendix C: Kullback-Liebler Divergences of Selected Probability
Distributions 253
C.1 Conceptual Notes on the KL Distance 253
C.2 KL of the Gaussian Probability Distribution 255
C.3 KL of the Gamma Probability Distribution 255
C.4 KL of the Joint Gamma-Gaussian or Gamma-Gamma Distribution 258
Appendix D: Junction Channel Computation and Source 262
D.1 Junction Channel Source Code 264
Appendix E: RGBY and CIE Lab Color Conversion 267
E.1 RGBY Color Conversion 267
ix
E.2 CIE Lab Color Conversion 268
Appendix F: HSV Color Conversion Source 273
F.1 RGB to HSV Transformation 274
F.1.1 HSV Transformation C / C++ Code 275
F.2 HSV to RGB Transformation: 278
F.2.1 RGB Transformation C/C++ Code 279
Appendix G: H2SV Color Conversion Source 281
G.1 HSV to H2SV Transformation 282
G.1.1 HSV to H2SV1 Variant 282
G.1.2 HSV to H2SV2 Variant 283
G.2 H2SV to HSV Simple Transformation 284
G.2.1 H2SV1 to HSV Simple 284
G.2.2 H2SV2 to HSV Simple 284
G.3 H2SV to HSV Robust Transformation 285
G.3.1 General Computations: 285
G.3.2 C / C++ Code for Robust Transformation 286
Appendix H: Selected Figure Graphing Commands for Mathematica 288
x
List of Tables
Table 2.1: Variance accounted for in ICA/PCA. 50
Table 3.1: Table of probabilities of results at random. 107
Table 3.2: Types of features found salient by CINNIC. 119
Table 4.1: M-W feature significance per type. 145
Table 6.1: MAG scores for color spaces. 213
Table 6.2: MAG scores for junction filters. 215
Table 6.3: MAG scores for statistical models. 217
Table 6.4: MAG scores for different values of beta. 223
xi
List of Figures
Figure 1.1: Examples of retinotopic maps of the visual cortex. 2
Figure 1.2: What does the brain find visually interesting? 4
Figure 1.3: Why the brain looks for so many types of features. 6
Figure 1.4: The increasing complexity of the visual system. 7
Figure 1.5: Examples of basic feature detectors. 11
Figure 1.6: Generations of feature based attention models. 13
Figure 1.7: Orientation features and Gabor pyramid example with Ashes. 15
Figure 1.8: Butterfly regions and contour integration example. 16
Figure 1.9: Examples of top-down models of attention. 19
Figure 2.1: Bayesian priors and Meta Priors spectrum. 26
Figure 2.2: From features to ICA to clustering. 32
Figure 2.3: The VFAT architecture graph. 35
Figure 2.4: General saliency model graph. 36
Figure 2.5: Junction detection from INVT features with ICA. 38
Figure 2.6: Feature clustering example shown with node climbing. 40
Figure 2.7: Examples of feature clustering on different data points. 42
Figure 2.8: NPclassify compared with K-means. 43
Figure 2.9: Example of similarity by statistical overlap. 45
Figure 2.10: Example of feature output following ICA/PCA. 51
Figure 2.11: ICA inversion and color features. 53
Figure 2.12: Example of image feature clustering. 54
xii
Figure 2.13: NPclassify compared quantitatively with K-means. 57
Figure 2.14: Features clustered during tracking. 58
Figure 2.15: The simple feature tracker. 59
Figure 2.16: The complex tracker handing off to simple trackers. 62
Figure 2.17: Screen shot of the VFAT based tracker. 63
Figure 3.1: The Braun Make Snake contour. 70
Figure 3.2: The basics of contour alignment and processing. 78
Figure 3.3: Neuron priming diagram. 80
Figure 3.4: Neuron group suppression in theory. 82
Figure 3.5: The basics of the CINNIC alignment and processing. 84
Figure 3.6: Hypercolumns and pseudo-convolution. 91
Figure 3.7: Breakdown of the CINNIC process. 95
Figure 3.8: CINNIC multiple scales and averaging. 96
Figure 3.9: 2AFC simulation for the Polat Sagi display. 99
Figure 3.10: Fit of CINNIC to observer AM. 101
Figure 3.11: Interaction of element size and enhancement. 103
Figure 3.12: CINNIC working on Make Snake contours. 105
Figure 3.13: Performance of CINNIC on Make Snake contours. 106
Figure 3.14: The subjective perception of contours and element separation. 108
Figure 3.15: Accounting for performance of CINNIC with kernel size. 110
Figure 3.16: CINNIC sensitivity to junctions. 113
Figure 3.17: Explaining sensitivity of junctions by CINNIC. 115
Figure 3.18: CINNIC sensitivity to salient locations and face features. 120
xiii
Figure 3.19: CINNIC and fast plasticity. 127
Figure 4.1: Overview of the surprise system. 138
Figure 4.2: The surprise map over sequence frames. 141
Figure 4.3: Peaks of surprise seem predictive. 144
Figure 4.4: Mean surprise and visual features. 148
Figure 4.5: Standard deviation of surprise and visual features. 150
Figure 4.6: Spatial location of max surprise and visual features. 151
Figure 4.7: The surprise prediction system. 155
Figure 4.8: How surprise prediction was analyzed. 158
Figure 4.9: Performance of surprise prediction. 162
Figure 4.10: Theoretical aspects of surprise prediction. 171
Figure 5.1: Surprise peaks at flankers for hard targets. 179
Figure 5.2: Attention gating and the contents of working memory. 180
Figure 5.3: From RSVP to attention gate computation. 182
Figure 5.4: Computation of the attention gate. 186
Figure 5.5: Computing the overlap ratio. 189
Figure 5.6: Surprise attention gate quantitative results. 191
Figure 5.7: Subjective results on Transportation Targets. 192
Figure 5.8: Subjective results on Animal Targets. 193
Figure 5.9: Explaining past results for Easy-to-Hard. 195
Figure 5.10: Attention gating and detecting multiple targets. 199
Figure 6.1: Which of the two models is better or worse? 202
Figure 6.2: Pretty fisher information graph 205
xiv
Figure 6.3: The MAG, an overview. 207
Figure 6.4: A general color space overview. 208
Figure 6.5: RGBY Color space example. 210
Figure 6.6: CIE Lab color space example. 211
Figure 6.7: H2SV2 color space example. 212
Figure 6.8: MAG and color space results. 213
Figure 6.9: MAG and junction filter results. 215
Figure 6.10: MAG and statistical model results. 217
Figure 6.11: The asymptotic behavior of beta. 220
Figure 6.12: MAG performance for different values of beta. 223
Figure B.1: Different views on the Gamma PDF. 247
Figure B.2: Surprise in Wows! 248
Figure B.3: The DoG Filter. 251
Figure C.1: From a PDF to the integrated KL region. 254
Figure C.2: The Joint gamma-gamma KL. 257
Figure D.1: The junction filter. 262
Figure E.1: CIE 1931 XYZ color space. 269
Figure E.2: Map of the CIE Lab gamut space. 270
Figure F.1: HSV color space. 273
Figure G.1: H2SV color space. 281
xv
Abbreviations
AI Artificial Intelligence
AIP Anterior Interparietal Sulcus
AMD Advanced Micro Devices
BPNN Back Propagation Neural Network
CIE International Commission on Illumination
CINNIC Carefully Implemented Neural Network for
Integrating Contours
CRT Cathode Ray Tube (monitor)
DoG Difference of Gaussian
EPSP Excitatory Post Synaptic Potential
EQ Equation
ERF Error Function
ERFC Complementary Error Function
fMRI Functional (Nuclear) Magnetic Resonance Imaging
FS Fast Spiking
GABA Gamma Aminobutyric Acid
GB Gigabyte (1 billion bytes)
GCC GNU C++ Compiler
GIMP GNU Image Manipulation Program
GNU GNU's Not Unix [sic] (An open source, free
software consortium)
GPL GNU General Public License
xvi
HSV Hue/Saturation/Value
H2SV HSV Variant with two hue components
H2SV2 H2SV with Red/Green Blue/Yellow opponents
Hz Hertz (cycles per second)
ICA Independent Component Analysis
INVT iLab Neuromorphic Vision Toolkit
IPSP Inhibitory Post Synaptic Potential
IT Inferior Temporal Cortex
KL Kullback-Liebler Divergence (sometimes called the
KL distance)
Lab CIE Lab Color (Luminance with two
opponents, a Red/Green b Blue/Yellow)
MAG Metric of Attention Gate
MHz Megahertz (1,000,000 cycles per second)
ms Milliseconds (1/1000 of a second)
O Worst Case Asymptotic Complexity (called
the big “O” notation)
OpenCV Open Computer Vision (Intel Toolkit)
PCA Principal Component Analysis
PDF Probability Distribution Function
PFC Pre-Frontal Cortex
POMDP Partially Observable Markov Decision Process
RAM Random Access Memory
RGB Red, Green and Blue Color
xvii
RGBY Red/Green and Blue/Yellow Color
RMSE Root Mean Squared Error
RSVP Rapid Serial Vision Presentation
SMA Supplementary Motor Area
SQRT Square Root
T Terrell
TD Temporal Difference
V1 Primary Visual Cortex
V2 – V5 Regions of Extrastriate Cortex
VFAT Vision Feature Analysis Toolkit
WTA Winner Take All
xviii
Abstract
What draws in human attention and can we create computational models of it
which work the same way? Here we explore this question with several attentional models
and applications of them. They are each designed to address a missing fundamental
function of attention from the original saliency model designed by Itti and Koch. These
include temporal based attention and attention from non-classical feature interactions.
Additionally, attention is utilized in an applied setting for the purposes of video tracking.
Attention for non-classical feature interactions is handled by a model called CINNIC. It
faithfully implements a model of contour integration in visual cortex. It is able to
integrate illusory contours of unconnected elements such that the contours “pop-out” as
they are supposed to and matches in behavior the performance of human observers.
Temporal attention is discussed in the context of an implementation and extensions to a
model of surprise. We show that surprise predicts well subject performance on natural
image Rapid Serial Vision Presentation (RSVP) and gives us a good idea of how an
attention gate works in the human visual cortex. The attention gate derived from surprise
also gives us a good idea of how visual information is passed to further processing in
later stages of the human brain. It is also discussed how to extend the model of surprise
using a Metric of Attention Gating (MAG) as a baseline for model performance. This
allows us to find different model components and parameters which better explain the
attentional blink in RSVP.
xix
Preface
About this thesis
This thesis is about the computational modeling of visual attention and surprise.
The aspects that will be covered in this work include:
• Utilization of the computation of attention in engineering.
• Extensions to the computational model of attention and surprise.
• Explaining human visual attention and cognition from simulation using
computational models.
This work is integrative and based on the philosophy that computer vision is aided
by better understanding of the human brain and it’s already developed exquisite
mechanisms for dealing with the visual world as we know it. At the same time,
development of biologically inspired computer vision techniques, when done correctly,
yields insight into the theoretical workings of the human brain. Thus, the integration of
engineering, neuroscience and cognitive science gives rise to useful synergy.
The second chapter covers the utilization of saliency as an engineering topic. This
is an example of applying what we have learned from the human brain towards an
engineering goal pursued with real world applications in mind. It is somewhat more
applied and as a result, many components are not biologically motivated. The reader
should keep in mind that project goals placed constraints on what can be done. In this
xx
case, a real time system able to process images very quickly was needed. Additionally,
the project as is typical for engineering endeavors required “deliverables”.
Chapters three and six cover methods for extending or changing the way in which
surprise is computed. In the case of the former, a model of contour integration is created
and examined. This allowed the creation of an extension to the basic saliency model for
non-local interactions. Its primary contribution however turned out to be gainful
knowledge of the human visual mechanisms involved.
The fourth and fifth chapters deal with temporal dimensions of attention using
surprise. The goals are to test and extend the model to see if predictions can be made of
observer performance. Thus, it is suggested that a better fit model, which is improved in
its ability to predict human performance, is closer to the actual mechanisms which the
human brain uses. This also has reciprocal engineering applications since it can be used
to help determine what humans will attend to in a dynamic scene.
Graduate works not included in this thesis
I have tried to keep all work included in this document constrained to the topic of
visual attention and to work with salient results. As such, much of the work I have done
in pursuit of my doctorate is not included. These works include, but are not limited to (in
chronological order):
• The Beobot Project (Mundhenk, Ackerman, Chung, Dhavale, Hudson,
Hirata, Pichon, Shi, Tsui & Itti, 2003a)
• Schizophrenia and the Mirror Neuron System (Arbib & Mundhenk, 2005)
• Estimation of missing data in acoustic samples (Mundhenk, 2005)
xxi
• Surprise Reduction and Control (Mundhenk & Itti, 2006)
• Three Dimensional Saliency (Mundhenk & Itti, 2007)
Of interest in particular is the work on Schizophrenia and Mirror Neuron system which
has been cited 45 times according to Google scholar. Also of interest is the Beobot
project paper which was the most downloaded paper from iLab for three years straight,
and it is still in the top five downloads to this day.
Other works of interest not included in this thesis
Also not included is the large amount of educational materials created and posted
online. These include:
• http://www.cool-ai.com –AI homeworks, projects and lecture notes for usage in
AI courses.
• Wikipedia and Wikiversity – contributions including:
o http://en.wikiversity.org/wiki/Learning_and_Neural_Networks - Created
self-guided teaching page on Neural Networks.
o http://en.wikipedia.org/wiki/Cicadidae - Contributed Wikipedia featured
picture of the day and written content.
o http://en.wikipedia.org/wiki/Gamma_distribution - contributed graphics
and corrections.
o http://en.wikipedia.org/wiki/Kullback-Leibler_divergence - contributed
graphics and corrections.
o http://en.wikipedia.org/wiki/Methods_of_computing_square_roots -
Added algorithms and analysis.
xxii
• http://www.cinnic.org/CINNIC-primer.htm – Contour Integration Primer.
• http://www.nerd-cam.com/how-to/ - Detailed Instructions on how to build your
own robotic camera.
Don’t read the whole thesis
This thesis uses the standard “stapled papers” framework. While each chapter has
been integrated into a coherent work, they each will stand on their own. As a result, the
reader is advised to get what they want and get out. That is, go ahead and read a chapter
which interests you, but don’t bother to read other parts. However, there tends to be more
information here than in the authors papers cited. As such, this thesis may be of use in
getting some of the model details not covered in the authors published materials due to
space constraints in peer reviewed journals.
Have fun
T. Nathan Mundhenk
1
Chapter 1: A Brief Introduction to Vision and Attention
You got to work today without running over any pedestrians. How did you do
that? To be sure this is a good thing. You can pick up items without even thinking about
it; you can thumb through a magazine until you get to a favorite advertisement; You can
tell a shoe from a phone and you can tell if that giant grizzly bear is in fact gunning for
your ice cream cone. You do all sorts of things like this every day and frequently they
seem utterly simple. To be certain, sometimes you cannot find your keys to save your
life, but even while searching for them, you don’t bang into the furniture in your
apartment, at least to too much. How did you do this? I ask, because like just about every
person on earth, I’m not totally sure.
OK, true, you’ll be glad to know I have some ideas. However, the pages that
follow will only scratch the surface of how human beings such as ourselves view the
world. To this day, much of human vision still remains a mystery. However, many things
about human vision are well established. For instance, we do in fact see from our eyes
and the information from them does travel to our brain. The brain itself is where what we
see is processed and it turns out that its job it not merely to cool our blood as Aristotle
believed it to be.
However, there is a place between seeing and understanding which resides within
human brain itself, and how it takes the items in the world and places them into your
mind is a complicated story. In this work, we will focus on an important part of this
process, the notion of selection and attention. The idea as it were is that not everything
presented to our eyes makes its way from the retina in the eyes to the seat of
2
consciousness. Instead, it seems that most of what we perceive is just a fraction of what
we could. The brain is picky, and it only selects some things to present to us, but many
other things simply fade from being.
Figure 1.1: Retinotopy has been demonstrated repeatedly over the years in the visual cortex. Thus, its
existence is well founded. An early example is given by Gordon Holmes who studied brain injuries in
solders after the first world war (Holmes, 1917, Holmes, 1945) and traced visual deficits to specific injury
cites in visual cortex. Then with primate (Macaque) cortex experiments using Deoxyglucose (Tootell,
Silverman, Switkes & De Valois, 1982) it was shown that a pattern activated a region of visual cortex with
the same shape. However, this method was limited due to the fact that the animal had to be sacrificed
immediately after viewing the pattern in order to reveal and photograph the pattern on the cortex. Later in
2003, with fMRI using sophisticated moving target displays, (Dougherty, Koch, Brewer, Fischer,
Modersitzki & Wandell, 2003) regions in the human brain were shown to correspond to locations in the
visual cortex in much the same way. However, fMRI allows observation in healthy human volunteers,
which is a distinct advantage since more advanced experiments such as those involving motion can be
conducted.
What then does the brain do to select the things it wants to see? One could
suppose that a magic elf sits in a black box in the brain with a black magic marker
3
looking at photos of the world sent to it by the eyes. The elf inspects each photo and
decides if it’s something it believes you should see. Otherwise it marks it with an ‘x’
which means that another magic elf should throw the image away. The idea of magic
elves as a brain process is intriguing, however the evidence does not bear it out. Then
again, the brain is in some sense a black box. Thus, while we do not think that magic
elves are the basis for cognition, we still must make inferences about the brains basic
workings from a variety of frequently indirect evidence.
For instance, we can probe the brain of other primates. In figure 1.1 it is shown
that we know that the visual cortex receives information from the eyes in retinotopic
coordinates. We know this from experiments on primates where briefly flashed visual
patterns caused a similar pattern to form on the visual cortex (Inouye, 1909, Holmes,
1917, Holmes, 1945, Tootell et al., 1982). Does the same thing happen in the human
brain? The general consensus is yes, many pieces of visual information from the eye line
up on the back of the brain somewhat like a movie projecting onto a screen. Newer
studies with functional magnetic resonance imaging (fMRI) on humans reinforces this
idea (Horton & Hoyt, 1991, Hadjikhani, Liu, Dale, Cavanagh & Tootell, 1998,
Dougherty et al., 2003, Whitney, Goltz, Thomas, Gati, Menon & Goodale, 2003). Still,
the evidence is indirect. No one has seen the movie on the back of the brain, but
fortunately, the evidence is satisfying.
Retinotopy in the visual cortex is an example of something which is well founded
even if the evidence is sometimes indirect. However, do we have such a good notion
about how the brain selects what it wants to see from input coming from the eyes? It
turns out sort of, but not completely. However, this is not without good reason. What
4
captures ones attention is quite complex (Shiffrin & Schneider, 1977, Treisman &
Gormican, 1988, Mack & Rock, 1998). So for instance, things which are colorful tend to
get through the brain much easier than things which are dull. This is for instance why
stop signs are red and not gray. This is also why poisonous snakes or monarch butterflies
(which are also poisonous) have such vivid colors. Interestingly, it is not just the colors
which attract our attention it is how the colors interact. For instance, something which is
blue attracts more attention when it is next to something yellow while something red
tends to get more attention when it is next to something green. So it’s not just the color of
something that makes it more salient, it’s how the colors interact as opponents.
Figure 1.2: What does the brain find visually interesting? There are many things (from left to right). Good
continuation of objects which form a larger percept is interesting. Conspicuous colors, particularly the
opponents red/green and blue/yellow stand out. Objects with unique features and special asymmetries
(Treisman & Souther, 1985) compared with surrounding objects can stand out. Also motion is a very
important cue.
Ok, seems pretty simple, but that was just one piece of a rather gigantic puzzle.
Just a sampling of what is visually interesting is shown in figure 1.2. It turns out that
edges, bright patches, things which are moving and things which are lined up like cars in
traffic and … well many things all can attract your attention and control what it is that
your brain deems interesting. Still it gets even more complex, your brain itself can decide
to change the game and shift the priority on certain visual features. As an example, if you
5
are looking for a red checker, your brain could decide to turn up the juice on the red color
channel. That is, your brain can from the top-down change the importance of visual items
making some things which were less interesting more interesting and vice versa (Shiffrin
& Schneider, 1977, Wolfe, 1994a, Navalpakkam & Itti, 2007, Olivers & Meeter, 2008).
So just on the front, we can see that the notion of visual attention and what gets
from the retina in the eyes to the seat of thought is quite complex. It involves a great deal
of things which interact in rather complex and puzzling ways. However, as mentioned we
do know many things, and we are discovering new properties every day. Hopefully this
work will help to illuminate some of the processes by which the visual world can pass
through brain into the realm of thought.
1.1 What Does our Brain Want to Look For?
Imagine that the world was not in color. Further, imagine that all you could see
was the outlines of the stuff that makes up the world. You would still need to move
around without tipping over chairs and be able to eat and recognize food. What then
would draw your attention? You can still tell how to identify many things. After all, it is
the world of lines which makes up the Sunday comic strips. You might not be able to tell
something’s apart which you could back in our colorful world, but for the most part you
could tell a table from a chair or an apple from a snake. In this case, what would your
brain look for?
6
Figure 1.3: Why does the brain look for so many different types of features? It depends on what it needs to
find. Some images are defined by lines, others by colors and some by the arrangement of objects. All of the
images shown are interpretable even though typical natural scene information of one type or another seems
lacking. Shown from Top Left: Atari’s Superman, Picaso’s La Femme Qui Pleure, Gary Larson’s Far Side;
Bottom Left: Liquid Television’s Stick Figure Theater, Van Gogh’s Starry Night Over the Rhone.
In basic terms, what your brain wants to look for is information. Figure 1.3 shows
several different scenes which one can interpret even though the information is presented
very differently with typical information components such as color, lines or texture
missing. As will be reviewed later, images are comprised of features, which are the
essential bits of information for an image. These can include all of the above as well as
more complex features such as junctions and motion. Not all features are necessary for
object identification. A typical example is that people were able to enjoy television before
it was in color.
7
Figure 1.4: (Left) Features gain increasing complexity and their responses become more and more task
dependent. Additionally, visual information is sent down multiple pathways for different kinds of
processing (Fagg & Arbib, 1998). Here the task of grasping a mug will prime features related to a mug top-down.
These features in turn will be processed in different ways depending on whether we are trying to
identify the mug (Ventral: What) or if we are trying to understand its affordances (Dorsal: How) (Ingle,
Schneider, Trevathen & Held, 1967). How the brain splits visual information in this way and then
reintegrates it, is still not completely understood.1 (Right) The connection diagram of Felleman and Van
Essen (Felleman & Van Essen, 1991) of the primate visual cortex demonstrates that elegant models such as
the one by Fagg and Arbib still only scratch the surface of the full complexity of the workings of the brain.
In addition to the essential bits of an image which are important, what the brain
wants to see is also based on the task at hand. Figure 1.4 illustrates a model for the task
of grasping an object (Fagg & Arbib, 1998). Initially the object to be grasped must be
spotted. If a person has some idea of what they are looking for, then they can attempt to
try and focus their attention towards something that matches the expected features of the
object. For instance, if the object to be grasped is a red mug, then the initial search for it
should bias one to look for red and round things. Such a bias becomes even more
1 This is a reconceptualiztion of the original Fagg & Arbib figure which appears in:
[44] Fellous, J.-M., & Arbib, M.A. (2005). Who Needs Emotions? The Brain Meets the Robot. Oxford:
Oxford University Press.
8
important in a cluttered scene where many simple salient items may be a distraction.
Otherwise, finding a red mug in a plain white room would be more simple.
Once the object has been spotted, appropriate features must be further extracted
such as geometric land marks (Biederman, 1987, Biederman & Cooper, 1991). So the
brain will need to find essential characteristics of the object for the task. In this case, we
want to grasp or pick up the object. If a portrait of Weird Al Yankovic is painted on the
side of the mug, it might grab our attention, but it is unimportant for the task of acquiring
the mug. Instead, we should ignore the portrait and just scan the geometry. The task
might be entirely different if we had another action we wanted to execute. For instance, if
someone asks us whose face is on the mug, we would want to scan for face like features
and perhaps ignore the geometric properties completely.
In the mug example, we can imagine that many other factors might come into
play. For instance the scene might change unexpectedly. As an example, our clumsy
relative might have knocked over the mug. This sudden change in the scene would come
as a surprise and should initiate a change in attention priorities. If the coffee is flowing
towards my notebook computer I should notice that as soon as possible. Then I should
perhaps cancel my grasping action and search for paper towels or maybe make a grasp
for my computer.
The brain also sometimes has very little choice in what it looks for. Some things
are highly salient such as a stop sign or an attractive person. It can be hard to override the
innate bottom-up search system at times. Thus, many things are attended to fairly quickly
and automatically. This is a rather important trait, a rock hurling towards you at great
speed demands your attention more than a cup of coffee. As such, we can see that what
9
the brain wants to see also depends on automatic bottom-up systems which can preempt
current task demands.
1.2 How Does our Brain Search for What it Wants?
1.2.1 What’s a Feature?
What the brain wants to see is based on what is useful for it to see. Early on, after
the invention of photography in the 19th century, many artists began to rethink what it
was that they were doing. Up until then, artists created the essence of photographs with a
paint brush, but since a machine could do the same thing faster and cheaper, direct
photographic style artistry seemed like it would become archaic. This helped to bring
about the Impressionist style of art. What is notable to our discussion is that artists began
to experiment with imagery where fundamental features of a painting could be altered,
but the scene could still be interpreted. As structure and form of art was changed and
experimented with, it became more obvious that the brain did not need a direct
photograph of a scene in order to understand it. Instead, it merely needed some form of
structure which resembled the original scene.
Partially as a result of this new way of looking at the world, early 20th century
cognitive scientists began to think about how objects and components of an image could
be linked together to create something which the brain could understand. Both
Structuralists such as William James (James, 1890) and in particular Gestalt
psychologists such as Max Wertheimer (Wertheimer, 1923) and Kurt Koffka (Koffka,
1935) began to think about how the brain can take in parts of a scene and assemble them
10
into something the brain understands. They believed that perception was a sum of the
parts, but at the time they lacked the scientific abilities to prove their ideas.
That the visual world was composed of parts which the brain assembles had been
proposed. However, what these parts looked like or what form they took was far from
certain. Several theories came forward over the years to refine what kind of parts the
brain uses to create the whole. A popular term for the elementary parts of an image was
features. Several scientists in the 1950’s such as Gibson (Gibson, 1950), Barlow (Barlow,
1961) and Attneave (Attneave, 1954) began to note that prior information about shapes,
line and textures could be collected and used to interpret abstracted scenes statistically. In
particular, Fred Attneave proposed that much of the visual world is redundant and
unnecessary for the task of recognition. A cat for instance could be represented as points
(or perhaps better as junctions) which are connected by the brain to form the perception
of a cat. Under this assumption, a large mass of visual information presented to the
retinal, for instance all the parts of the image which are not junctions are extraneous.
Partially as a result of such assertions, several theories were put forward claiming that
there should be a bottleneck in attention (Broadbent, 1958, Deutsch & Deutsch, 1963).
As such, the picture of the visual world was still hazy, but several theories were now
giving an idea of how the brain sees the world and what it wants to find. First, the brain
compiles images from parts to create a whole. Second, features of an image as simple as
points, lines, textures or junctions scattered about a scene may be sufficient in order for
the brain to understand an image, but that there may be limits on how much the brain can
process at one time. However, several questions remained. First, what kind of features is
11
the brain looking for and second how does the brain look for and process these features
keeping in mind that it has some limitations on capacity?
Figure 1.5: (Left) Early visual processing by the brain looks for simple features. For instance the retinal
begins by sorting out color opponents such as red/green and blue/yellow (Kuffler, 1953, Meyer, 1976).
While the understanding of the center surround mechanism is somewhat recent, knowledge of the
arrangement of color opponents is very old and its theory can be traced at least as far back as to the German
physiologist Ewald Hering in 1872 (Hering, 1872) but was first described physiologically in the goldfish
(Daw, 1967). We can simulate these mechanisms using the filters shown. Here we see DoG (Difference of
Gaussian) filters which give the On Center / Off Surround response (von Békésy, 1967, Henn & Grüsser,
1968) for colors (Luschow & Nothdurft, 1993). (Right) Later, the visual cortex utilizes hyper columns
(Hubel & Wiesel, 1974) to find lines in an image. We can use wavelets like the one on the right to give a
response to lines in an image (Leventhal, 1991). The type of wavelets used are typically called Gabor
wavelets in honor of the Hungarian engineer Gábor Dénes (Dennis Gabor). (Bottom) The bottom row
shows a cross section of the filters on the top.
The answers to these questions began to congeal with the development of
improved psychometric instrumentation in the 1960s that could better time and control
the reaction of human subjects with a wide variety of stimulus. [For instance see
(Sperling, 1960, Raab, 1963, Sperling, 1965, Weisstein & Haber, 1965)]. This was
accompanied by improved psychophysical instrumentation capable of direct
12
measurement of neural activity in animals [For instance (Daw, 1968, Henn & Grüsser,
1968)]. By the 1970’s combined with the seminal work by David Hubel and Torsten
Wiesel (Hubel & Weisel, 1977) we were starting to get a pretty good idea of what kind
of elementary features the brain is looking for. In figure 1.5 we see some of the features
which we knew the brain to be sensitive to by the mid 1970’s. The brain has literal
detectors for lines and color opponents such as red/green and blue/yellow. It should be
noted however, that this is still the beginning of the story. We knew that there was a set
of simple features which the visual cortex would pick up on, but there was no idea how
these features could be assembled into larger objects. Additionally, were there more
features or was this the full basis set?
1.2.2 How do we Integrate These Features?
By the 1970’s two important concepts were beginning to emerge. One was the
notion of focused attention. That is, if Attneave and his contemporaries are correct, the
brain might be wise to only spend time processing parts of a scene and not the whole
thing. Second, features such as lines and colors integrate and bind in the brain. For
instance, it had been known since the 1930’s that the brain can bind colors and words.
John Stroop (Stroop, 1935) showed that by flashing a word such as “blue” but coloring it
red tended to trip up and slow down observers when asked to name it. Would such a
mechanism also apply at the level of feature integration?
13
Figure 1.6: Three generations of models of feature based attention are shown in succession. Starting with
Treisman, Gelade & Gormican (Treisman & Gelade, 1980, Treisman & Gormican, 1988)2 it was
hypothesized that the way visual features such as lines and colors integrate in parallel controls the serial
components of attention. This model itself is a refinement of earlier theories of attention, for instance
Shiffrin and Schneiders theory of automatic and controlled attention (Shiffrin & Schneider, 1977) and the
pre-attentive and focal attention model of Neisser (Neisser, 1967). Later Koch and Ullman (Koch &
Ullman, 1985) expanded this with the notion of having a saliency map which controls the spotlight of
attention with a winner-take-all network. Following this, it was made into a fully functional computational
model by Itti and Koch (Itti & Koch, 2001b).
Several theoretical constructs were advanced and lead to increasing understanding
on the question of attention (Figure 1.6). It was discovered that attention seems to be
focal and that only parts of an image actually reach what many people would call
consciousness. In 1967, this hypothesis was put forward by Ulric Neisser (Neisser, 1967)
who suggested that there was a pre-attentive phase to visual processing when features
were gathered together in parallel, but that later the features combined and were inspected
serially by focal attention. This was further expanded by Richard Shiffrin and Walter
Schneider (Shiffrin & Schneider, 1977) who saw a second dimension to attention. They
suggested that some parts of attention are automatic and some parts are controlled. That
2 This drawing is from Treisman and Gormican 1988. It is based on the feature integration theory given in
Treisman and Gelade 1980. However, Treisman and Souther 1985 gives a very similar figure.
14
is, some features in an image grab our attention automatically and almost reflexively.
However, we are also consciously able control some things which we attend to. This is
what is now thought of in broader terms as bottom-up and top-down attention.
In 1980, Anne Treisman and Gerry Gelade further refined these ideas into a
Feature Integration theory of attention (Treisman & Gelade, 1980). There idea was that
the parallel computation of Neisser could be split into different features which could be
processed separately in the pre-attentive stage and then brought together. Thus, the brain
would compute its interest in colors, lines and intensities at the same time and that it is
the sum integration of different features which determines the locus of attention. That is,
attention is driven simultaneously be each type of feature, but the conjunction or
independent dominance of a feature can draw in attention. However, the question was left
open as to how the features could combine to create a master map of attention. A possible
answer was given by Christof Koch and Shimon Ullman (Koch & Ullman, 1985) who
gave the idea that the brain maintained a saliency map for the visual world and that a max
selector processes (Didday, 1976, Amari & Arbib, 1977) would refine the saliency map
so that only a single location in the visual field would stick out. This allowed for many
things in the world to be salient at the same time, but suggested that the most salient item
of all is that one which the brain will attend to.
The theories of attention put forward by Treisman et al as well as Koch and
Ullman gained further support over the next decade due to a variety of experimental
results [For examples see (Nothdurft, 1991b, Nothdurft, 1991a, Nothdurft, 1992,
Luschow & Nothdurft, 1993)]. In 1998 Laurent Itti, Christof Koch and Ernst Niebur
further refined the model of Koch and Ullman and created a comprehensive
15
computational model that allowed direct testing of it (Itti, Koch & Niebur, 1998). It also
included a comprehensive set of feature detectors as well as a Gaussian/Laplacian
pyramid to detect features at many different scales (Figure 1.7).
Figure 1.7: Gabor wavelet filters give a response to lines in an image. One way to do this is to create four
or more wavelet filters each with its own directional orientation (Itti et al., 1998). On the left this can be
seen as filters sensitive to lines are 0, 45, 90 and 135 degrees. On the right is an image which has been
convolved by the filters at 0 and 90 degrees and the lines that were extracted by the filters. Since lines have
different sizes we can convolve each image at a different scale to increase our chances of discovering lines
of different widths (Tanimoto & Pavlidis, 1975, Burt & Adelson, 1983, Greenspan, Belongie, Goodman,
Perona, Rakshit & Anderson, 1994)3.
The essential gain was that the computer could be treated like a brain in a box. If the
model of Koch and Ullman was correct, then a comprehensive computational model
should have parity with the behavior of humans. Initial results showed that the saliency
3 The cats name is Ashes.
16
Figure 1.8: (Top Row) Features that the brain is looking for get increasingly complex. This happens
frequently when simpler features are combined to create new ones (Field, Hayes & Hess, 1993, Kovács &
Julesz, 1993, Polat & Sagi, 1994, Gilbert, Das, Ito, Kapadia & Westheimer, 1996, Li, 1998, Mundhenk &
Itti, 2005). For instance, line fragments which Gabor filters pick up on can then be connected in a
corresponding zone which completes contours. The butterfly pattern on the left will complete a contour
when line fragments lie in the green zone and are aligned. This can be seen on the right where three co-linearly
aligned fragments enhance each other to give a larger response. The graph is somewhat crude, but
the point is that the more elements that are aligned, the stronger the response. (Bottom Row) The elements
aligned into a circle on the left are much more salient than random elements (Kovács & Julesz, 1993,
Braun, 1999). They should produce an activation pattern like the one on the right (Mundhenk & Itti, 2003,
Mundhenk & Itti, 2005).This is discussed at length in chapter 3.
17
model behaved in a manner that was expected (Itti & Koch, 2001b). The computational
saliency model was able to detect many odd-man-out features, search asymmetries and
conditions for pop-out that would be expected of human observers. Additionally, the
model could be augmented to included top-down attentional effects (Itti, 2000) by
adjusting features weights in a manner similar to the mechanism proposed 25 years
earlier for directed attention by Shiffrin and Schneider (Shiffrin & Schneider, 1977).
Thus, for instance, when looking for a red Coke can, it is almost a simple matter to
weight the red feature more during search.
1.2.3 Beyond the Basic Saliency Model
The original saliency model of Itti and Koch lacked three components. One was
the interaction of non-local features. Thus, as can be seen in figure 1.8, contours and line
segments which extend past the classic receptive fields of the basic feature detectors have
been found to be salient (Kovács & Julesz, 1993, Polat & Sagi, 1993b, Gilbert et al.,
1996, Braun, 1999, Geisler, Perry, Super & Gallogly, 2001). The second element missing
was temporal attention. This itself is comprised of three components which may or may
not be independent of each other. They are motion, change and masking. Thus, things
which are in motion tend to draw our attention. However simple changes such as the
appearance or disappearance of an element in a video can draw or attention as well
(Mack & Rock, 1998). The third element of temporal attention, masking, has been
studied quite extensively (Breitmeyer & Öğmen, 2006). It is where something at one
instance in a sequence of images is blocked from perception by something spatially
proximal that comes before or after it. It includes both backwards and forwards masking,
18
the attentional blink (Raymond, Shapiro & Arnell, 1992) and both automatic and
controlled mechanisms (Sperling & Weichselgartner, 1995, Olivers & Meeter, 2008).
Further, the temporal components of attention are hypothesized to be comprised of more
than one processing stage (Chun & Potter, 1995). The third element, top-down attention
has been partially implemented since the original model was incepted (Itti, 2000,
Navalpakkam & Itti, 2005). However, a complete model of top-down attention is
probably many years away since it requires construction of the “top” component which
may include consciousness itself.
A non-local extension to the saliency model was eventually provided by T Nathan
Mundhenk (Mundhenk & Itti, 2003, Mundhenk & Itti, 2005) and was extensively tested.
This is covered in chapter 3. The extensions to temporal saliency are covered in
chapters 2, 4, 5 and 6. They include extensions by the addition of a motion channel in
chapter 2 (Mundhenk, Landauer, Bellman, Arbib & Itti, 2004b, Mundhenk,
Navalpakkam, Makaliwe, Vasudevan & Itti, 2004c, Mundhenk, Everist, Landauer, Itti &
Bellman, 2005a) and extension by the usage of Bayesian Surprise in chapters 4, 5 and 6
(Itti & Baldi, 2005, Einhäuser, Mundhenk, Baldi, Koch & Itti, 2007b, Mundhenk,
Einhäuser & Itti, 2009).
1.3 The Current State of Attention and Other Models
Many contemporary models of attention are designed to address one or more of
the shortcomings of the original saliency model discussed in the last section, while many
are attempts at general improvements or are different models altogether.
19
1.3.1 Top-Down Models
Modeling the factors of top-down v. bottom-up attention goes back very far. As
can be seen in figure 1.9 an early model was provided by Shiffrin and Schneider, but that
model lacked a good notion of feature integration as well as an attentional map. Jeremy
Wolfe (Wolfe, 1994a) provided a good synthesis of the model of Shiffrin and Schneider
with the model of Koch and Ullman. Thus, the affects of top-down controll were merged
with a feature integration attention model which also included an attention map.
However, this is an example of a static scene top-down model. That is, prior knowledge
is integrated as a top-down mechanism, but not necessarily online. Current extensions of
this model include the integration of task influence (Navalpakkam & Itti, 2005) as well as
an explanation of feature tuning (Navalpakkam & Itti, 2007).
Figure 1.9: (Left) An early example of an attention model with top-down guided search activation is the
attention model of Shiffrin and Schneider (Shiffrin & Schneider, 1977). Here automatic parallel processing
layers that compute attention can be controlled by a more serialized attention director. (Right) The model
by Wolfe (Wolfe, 1994a) is conceptually a synthesis of Shiffrin & Schneider with Koch and Ullman (Koch
& Ullman, 1985). That is, it has added feature integration and a saliency map.
20
Many other models which integrate top-down attention are concerned with online
handling of features as well as task demands. Sperling et al (Sperling, Reeves, Blaser, Lu
& Weichselgartner, 2001) has provided one such model with a gamma shaped window
function of attention. Task it treated as a spatial cue to certain locations allowing a
“Quantal” discrete attention window to be opened at that location for a certain amount of
time. It also includes bottom-up attention using the original term “automatic” attention.
However, like with the model of Wolfe, it has not been nearly as completely
implemented as the Itti and Koch model. One might consider it a partial implementation
in comparison.
A recent and important contribution to the modeling of top-down attention is
provided by Olivers and Meeter. This is known as the Boost and Bounce theory of
attention (Olivers & Meeter, 2008). In many ways it is an extension of Sperling et al, but
it has more explicit handling of features as well as an improved description of the
interaction of frontal cortical mechanisms with visual cortical processing. Again,
however, the implementation is very computationally limited.
1.3.2 Other Contemporary Models of Saliency
Currently there are a variety of other attention models in existence. Some are
variants of the model of Itti and Koch (Frintrop, 2006, Itti & Baldi, 2006, Gao,
Mahadevan & Vasconcelos, 2008) while others are more unique (Cave, 1999, Li, 2002,
Bruce & Tsotsos, 2006). The model by Simone Frintrop is known as VOCUS. Its goal is
to use models of saliency to improve computer vision search. It implements top-down
task improvements in a manner similar to Itti and Koch, but adds a top-down
21
excitation/inhibition mechanism. It also uses the CIE Lab (McLaren, 1976) color space
for color opponents and implements a form of 3D saliency for laser range finders.
Dashan Gao et al (Gao et al., 2008) have implemented an interesting variation on
Itti and Koch which is to change the treatment of center surround interactions. The center
surround response is termed “Discriminant” center surround because it forms a center
surround response based on the strength of a linear discriminant. The more crisp the
discrimination of the center of a location is when compared with its surround, the
stronger a response is given at that location. However, this is a mechanism very similar to
the way the model of Surprise (Itti & Baldi, 2005, Itti & Baldi, 2006) computes spatial
attention.
The model by Bruce and Tsotsos (Bruce & Tsotsos, 2006) is an information
maximization model. It works by taking in a series of images and forming a bases set of
features. The bases set is then used to convolve an image. The response to each basis
feature is competed against the basis features from all other patches. Thus, if a basis
feature gives a unique response at an image location, it is considered salient. The most
notable difference with this model compared with Itti and Koch is the derivation of basis
features from prior images similar to Olshausen and Field (Olshausen & Field, 1996).
However, the rectification using a neural network may compute competition in a way
which is not sufficiently different from a WTA competition, but it may be arguably more
biologically plausible.
The model by Li is much more different. Li’s model (Li, 2002) is strongly model
theoretic and somewhat neglects the task of image processing. However, it is claimed that
it can provide saliency pre-attentively without the use of separate feature saliency maps.
22
Thus, the model should compute a singular saliency from combining features responses
at the same time. This may be a more plausible method for computing saliency, but it is
unclear if it functionally gains much over other models of saliency.
1.3.3 The Surprise Model
There are two notable trends with saliency models. One is the emergence of
information theoretic constructs and the other is the continued divergence between static
saliency models and dynamic models of attention. With the recent exception of Gao (Gao
et al., 2008) attention models were either static feature based models or dynamic, but
primarily theoretical models (Sperling et al., 2001). The introduction of Surprise based
attention (Itti & Baldi, 2005, Itti & Baldi, 2006) created for the first time a statistically
sound and dynamic model of attention. In chapter 4, we will introduce surprise based
attention and show that it does an excellent job of taking into account dynamic attentional
effects seen in rapid serial vision experiments. This is then shown to give a good
framework for a short term attention gate mechanism in chapter 5.
In short, the new framework has some similarities to Bruce and Tsostos in that
prior images are used to create belief about new images. However, surprise computes
these beliefs online. This means that it does not need to be trained or have strong prior
information about feature prevalence. Instead the sequence provides the needed
information. The extensive testing and validation in chapters 4-6 also demonstrate firmly
that it explains many temporal attention effects. Additionally, we postulate that we have
gained further insight into the attentional window into the brain.
23
Chapter 2: Distributed Biologically Based Real Time Tracking
with Saliency Using Vision Feature Analysis Toolkit (VFAT)4
In a prior project, we developed a multi agent system for noticing and tracking
different visual targets in a room. This was known as the iRoom project. Several aspects
of this system included both individual noticing and acquisition of unknown targets as
well as sharing that information with other tracking agents (Mundhenk et al., 2003a,
Mundhenk, Dhavale, Marmol, Calleja, Navalpakkam, Bellman, Landauer, Arbib & Itti,
2003b). This chapter is primarily concerned with a combined tracker that uses the
saliency of targets to notice them. It then classifies them without strong prior knowledge
(priors) of their visual feature, and passes that information about the targets to a tracker,
which conversely requires prior information about features in order to track them. This
combination of trackers allows us to find unknown, but interesting objects in a scene and
classify them well enough to track them. Additionally, information gathered can be
placed into a signature about objects being tracked and shared with other camera agents.
The signature that can be passed is helpful for many reasons since it can bias other agents
towards a shared target as well as help in creating task dependant tracking.
2.1.1 Vision, Tracking and Prior Information
For most target acquisition and tracking purposes, prior information about the
targets features is needed in order for the tracker to perform its task. For instance, a basic
color tracker that tracks objects based on color needs to know a priori what the color of
4 For more information see also: http://ilab.usc.edu/wiki/index.php/VFAT_Tech_Doc
24
the target that it wishes to track is. If one is going to track a flying grape fruit, then one
would set a tracker with a certain color of yellow and some threshold about which the
color can vary. In general, many newer trackers use statistical information about an
objects features which allows one to define seemingly more natural boundaries for what
features one would expect to find on a target (Lowe, 1999, Mundhenk et al., 2004b,
Mundhenk et al., 2004c, Mundhenk et al., 2005a, Siagian & Itti, 2007). However, in
order to deploy such a tracker, one needs to find the features, which describe the object
before tracking it. This creates two interesting problems. The first problem is that the set
of training examples may be insufficient to describe the real world domain of an object.
That is, the trainer leaves out examples from training data, which may hold important
information about certain variants of an object. We might think for instance from our
flying grapefruit tracking example, that of the fruits that fly by, oranges never do. As a
result, we would unknowingly let our tracker have some leeway and track grapefruit that
might even be orange in appearance. It might however turn out that we were wrong. At
some point, an orange flies by and our tracker tracks it the same as a flying grapefruit.
This can happen for several reasons, the first is that we had never observed an orange fly
by and as such didn’t realize that indeed, they can fly by. Another reason is that the world
changed. When we set up the tracker, only grapefruits could fly by. However, the thing
that makes them fly, now acts on oranges, which may be an accidental change, for
instance if an orange tree begins to grow in our flying grapefruit orchard. However, it
might also be the case that someone has decided to start throwing oranges in front of our
tracker. As such, the domain of trackable objects can change either accidentally or
25
intentionally. In such a case, our tracker may now erroneously tracks flying oranges as
flying grapefruit.
As can be seen from our first example, our tracker might fail if someone tries to
fool it. Someone starts throwing oranges in front of our tracker, or perhaps they might
wrap our grapefruits in a red wrapper so that our tracker thinks they are apples. If we are
selling our flying grapefruits and our tracker is supposed to make sure each one makes it
to a shipping crate, it would fail if someone sneaks them by as another fruit. As such,
once a dishonest person learns what our tracker is looking for, it becomes much easier to
fool. This is seen in the real world in security applications, such as Spam filtering, where
many email security companies have to update information on what constitutes Spam on
a regular bases to deal with spammers who learn simple ways around the filters. It should
be expected that the same problem would go for any other security related application
including a vision-based tracker. In the case of our flying grapefruit tracker, its function
may not be explicitly security related, but as a device related to accounting, it is prone to
tampering.
What is needed then for vision based tracking is the ability to be able to define its
own priors. It has been proposed that gestalt rules of continuity and motion allow visual
information to be learned without necessarily needing prior information about what
features individual objects possess (Von der Malsberg, 1981, Prodöhl, Würtz & von der
Malsberg, 2003, Mundhenk et al., 2004b, Mundhenk & Itti, 2005). That is, the human
visual system does not necessarily know what it is looking for, but it knows how to learn
how to look. This itself constitutes a kind of prior information which one might consider
meta-prior information. That is, information about what structure or meta-model is
26
needed to gather prior information, such as Bayesian information, is itself a type of prior
information. Using meta-prior information, an artificial agent might learn on its own how
to form groups that can be used to create statistical relationships and build new prior
information about what it wishes to track. Thus, abstractly speaking, meta-priors are
concerned with learning about how to learn.
2.1.3 Meta-priors, Bayesian Priors and Logical Inductive Priors
Figure 2.1: It is interesting to note how different AI solutions require different amounts of prior
information in order to function. Additionally, it seems that the more prior information a solution requires
the more certainty it has in its results, but the more biased it becomes towards those results. Thus, we can
place solutions along a spectrum based on the prior information required. Popular solutions such as Back
Propagation Neural Networks and Support Vector Machines seem to fall in the middle of the spectrum in
essence making them green machines and earning them the reputation of being the 2nd best solution for
every problem.
We propose that meta-priors are part of a spectrum of knowledge acquisition and
understanding. At one end of the spectrum, are the rigid rules of logic and induction from
which decisions are drawn with great certainty, but with which unknown variables must
be sparse enough to make those reasonable decisions (figure 2.1). In the middle we place
27
more traditional statistical methods, which either require what we will define as strong
meta-priors in order to work or require Bayesian priors. We place the statistical machines
in the middle, since they allow for error and random elements as part of probabilities and
do not need to know everything about a target. Instead, they need to understand the
variance of information and draw decisions about what should be expected. Typically,
this is gifted to a statistical learner in the form of a kernel or graph. Alternatively, the
meta-prior does not make an inference about knowledge itself, but instead is used to
understand its construction. From this, we then state, that meta-priors can lead to
Bayesian priors, which can then lead to logical inductive priors. From meta-priors we
have the greatest flexibility about our understanding of the world and in general terms,
the least amount of bias; whereas on the other end of the spectrum, logical inductive
priors have the least flexibility, but have the greatest certainty.
An ideal agent should be able to reason about its knowledge along this spectrum.
If a probability becomes very strong, then it can become a logical rule. However, if a
logical rule fails, then one should reason about the probability of it doing so.
Additionally, new things may occur which have yet unknown statistical properties. As
such, the meta-priors can be used to promote raw data into a statistical framework or to
re-reason about a statistical framework, which now seems invalid.
Using certain kinds of meta-prior information, many Bayesian systems are able to
find groupings which can serve as prior information to other programs which are unable
to do so themselves. However, most Bayesian models work from meta-priors that require
a variety of strong meta-priors. For instance, the most common requirement is that the
number of object or feature classes must be specified. This can be seen in expectation
28
maximization, K-means and back-propagation neural networks, which need to have a set
size for how many classes exist in the space they inspect. The number of classes thus,
becomes a strong and rather inflexible meta-prior for these methods. Additionally, other
strong meta-priors may include space size, data distribution types and the choice of
kernel.
The interesting thing about meta-priors is that they can be flexible or rigid. For
instance, specifying you have several classes that are fit by a Gaussian distribution is
semi-flexible in that you have some leeway in the covariance of your data, but the
distribution of the data should be uni-modal and have a generally elliptical shape. An
example of more rigid meta-priors would be specifying a priori the number of classes
you believe you will have. So for instance, going back to our grapefruit example, if you
believe your data to be Gaussian, you suspect that flying grapefruit have a mean color
with some variance in that color. You can make a more rigid assumption that you will
only see three classes such as, flying grapefruit, oranges and apples. All of these are of
course part of the design process, but as mentioned they are prone to their own special
problems. Ideally, an intelligent agent that wishes to reason about the world should have
the ability to reason with flexible weak meta-priors but then use those to define Bayesian
like priors. Here we define weak meta-priors as having flexible parameters that can
automatically adjust to different situations. So for instance, we might set up a computer
vision system and describe for it the statistical features of grapefruit, oranges and apples.
However, the system should be able to define new classes from observation either by
noticing that a mass of objects (or points) seem to be able to form their own category
(Rosenblatt, 1962, Dempster, Laird & Rubin, 1977, Boser, Guyon & Vapnik, 1992, Jain,
29
Murty & Flynn, 1999, Müller, Mika, Rätsch, Tsuda & Schölkopf, 2001, Mundhenk et al.,
2004b, Mundhenk et al., 2005a) or through violation of expectation and surprise (Itti &
Baldi, 2005, Itti & Baldi, 2006). An example of the first is that if we cluster data points
that describe objects, and if a new object appears such as a kiwi, a new constellation of
points will emerge. An example of the second is that if we expect an apple to fly by, but
see an orange, it suggests something interesting is going on. It might be that new fruit
have entered our domain. In the first case, our learning is inductive, while in the second
case it is more deductive.
We thus define weak meta-priors to be situationally independent. That is, the
meta-prior information can vary depending on the situation and the data. Ideally,
information within the data itself is what drives this flexibility. So for instance, when
selecting what is the most salient object in a scene, we might select a yellow ball.
However, a moving gray ball may be more salient if presented at the same time as the
yellow ball. Thus, the selection feature for what is most salient is not constantly a color,
but can also be motion. So it is the interplay of these features, which can promote the
saliency of one object over the other (Treisman & Gelade, 1980). Yet another example is
that the number of classes is not defined a priori as a strong meta-prior, but instead,
variance between features causes them to coalesce into classes. So as an abstract
example, the number of planets in a solar system is not pre-determined. Instead, the
interplay of physical forces between matter will eventually build a certain number of
planets. Thus, the physical forces of nature are abstractly a weak meta-prior for what kind
of planets will emerge, and how many will be formed.
30
2.1.4 The iRoom and Meta-prior Information
Here we now review a vision system for following and tracking objects and
people in a room or other spaces that can process at the level of weak meta-priors,
Bayesian priors and even logical inductive priors. From this, we then need artificial
experts, which can use weak meta-priors to process information into more precise
statistical and Bayesian form information. Additionally, once we know things with a
degree of certainty, it is optimal to create rules for how the system should behave. That
is, we input visual information looking for new information from weak meta-priors,
which can be used to augment a vision system that uses Bayesian information. Eventually
strong Bayesian information can be used to create logical rules. We will describe this
process in greater detail in the following pages but give a brief description here.
Using a biological model of visual saliency from the iLab Neuromorphic Vision
Toolkit (INVT) we find what is interesting in a visual scene. We then use it to extract
visual features from salient locations (Itti & Koch, 2001b) and group them into classes
using a non-parametric and highly flexible weak-meta prior classifier NPclassify
(Mundhenk et al., 2004b, Mundhenk et al., 2005a). This creates initial information about
a scene: for instance how many classes of objects seem present in a scene, where they are
and what general features they contain. We then track objects using this statistically
priorless tracker but gain advantage by taking the information from this tracker and
handing it to a simple tracker, which uses statistical adaptation to track a target with
greater effectiveness. In essence, it takes in initial information and then computes its own
statistical information from a framework using weak meta-prior information. That
31
statistical information is then used as a statistical prior in another simpler and faster
tracker.
2.2 Saliency, Feature Classification and the Complex Tracker
There were several components used in the tracking system in iRoom. As
mentioned, these started by needing less meta-prior information and then gathering
information that allows the tracking of targets by more robust trackers that require more
information about the target. The first step is to notice the target. This is done using
visual saliency. Here very basic gestalt rules about the uniqueness of features in a scene
are used to promote objects as more or less salient (Treisman & Gelade, 1980, Koch &
Ullman, 1985, Itti & Koch, 2001b). This is done by competing image feature locations
against each other. A weak image feature that is not very unique will tend to be
suppressed by other image features, while strong image features that are different will
tend to pop out as it receives less inhibition. In general, the saliency model acts as a kind
of max selector over competing image features. The result from this stage is a saliency
map that tells us how salient each pixel in an image is. Once the saliency of locations in
an image can be computed, we can extract information about the features at those
locations. This is done using a Monte Carlo like selection that treats the saliency map as a
statistical map for these purposes. The more salient a location in an image is, the more
likely we are to select a feature from that location. In the current working version we
select about 600 feature locations from each frame of video. Each of the feature locations
contains information about the image such as color, texture and motion information.
These are combined together and used to
32
Figure 2.2: The complex feature tracker is a composite of several solutions. It first uses INVT visual
saliency to notice objects of interest in a scene. Independent Component Analysis and Principle Component
Analysis (Jollife, 1986, Bell & Sejnowski, 1995, Hyvärinen, 1999) are used to reduce dimensions and
condition the information from features extracted at salient locations. These are fed to a non-parametric
clustering based classification algorithm called NPclassify, which identifies the feature classes in each
image. The feature classes are used as signatures that allow the complex tracker to compare objects across
frames and additionally share that information with other trackers such as the simple tracker discussed
later. The signatures are also invariant to many view point effects. As such they can be shared with cameras
and agents with different points of view.
classify each of the 600 features into distinct classes. For this we use the non-parametric
classifier NPclassify mentioned above. This classifier classifies each feature location
without needing to know a priori the number of object feature classes or how many
samples should fall into each class. It forms classes by weighting each feature vector
from each feature location by its distance to every other point. It then can link each
feature location to another, which is the closest feature location that has a higher weight.
This causes points to link to more central points. Where a central point links to another
cluster it is not a member of, we tend to find that the link is comparatively rather long.
33
We can use this to cut links, thus, creating many classes. In essence, feature vectors from
the image are grouped based on value proximity. As an example, two pixels that are close
to each other in an image and are both blue would have a greater tendency to be grouped
together than two pixels in an image that are far apart and are blue and yellow.
Once we have established what classes exist and which feature locations belong to
them, we can statistically analyze them to determine prior information that will be useful
to any tracker, which requires statistical prior information in order to track a target. Thus,
we create a signature for each class that describes the mean values for each feature type
as well as the standard deviation within that class. Additionally, since spatial locations
play a part in weighting feature vectors during clustering, feature vectors that are
classified in the same class tend to lie near each other. Thus, the signature can contain the
spatial location of the class as well. Figure 2.2 shows the flow from saliency to feature
classification and signature creation.
The signatures we derive from the feature properties of each class exist to serve
two purposes. The first is that it allows this complex tracker to build its own prior
awareness. When it classifies the next frame of video, it can try and match each of the
new objects it classifies as being the same object in the last frame. Thus, it is not just a
classifier, but it can track objects on its own for short periods. Further, we can use
information about targets to bias the classification process between frames. So for
instance, we would expect that the second frame of video in a sequence should find
objects which are similar to the first frame. As such, each classified object in any given
frame, biases the search in the next frame, by weighting the classifier towards finding
objects of those types.
34
While this seems very complex, signature creation is fairly quick, saliency
computation is done in real time on eight 733 MHz Pentium III computers in a Beowulf
cluster. The rest of the code runs in under 60 ms on an Opteron 150 based computer. This
means we can do weak meta-prior classification and extraction of signatures at around >
15 frames per second.
2.2.1 Complex Feature Tracker Components
2.2.1.1 Visual Saliency
The first stage of processing is finding which locations in an image are most
salient. This is done using the saliency program created by (Itti & Koch, 2001b), which
works by looking for certain types of uniqueness in an image (Figure 2.3). This simulates
the processing in visual cortex that the human brain performs in looking for locations in
an image, which are most salient. For instance, a red coke can placed among green
foliage would be highly salient since it contrasts red against green. In essence, each pixel
in an image can be analyzed and assigned a saliency value. From this a saliency map can
be created. The saliency map simply tells us the saliency of each pixel in an image.
2.2.1.2 Monte Carlo Selection
The saliency map is taken and treated as a statistical map for the purpose of
Monte Carlo selection. The currently used method will extract a specified number of
features from an image. Highly salient locations in an image have a much higher
probability of being selected than regions of low saliency. Additionally, biases from other
modules may cause certain locations to be picked over consecutive frames from a video.
For instance, if properties of a feature vector indicate it is very useful, then it makes sense
35
to select from a proximal location in the next frame. Thus, the saliency map combines
with posterior analysis to select locations in an image which are of greatest interest.
Figure 2.3: The complete VFAT tracker is a conglomeration of different modules that select features from
an image, mix them into more complex features and then tries to classify those features without strong
meta-priors for what kind of features it should be looking for.
36
2.2.1.3 Mixing Modules
2.2.1.3.1 Junction and End Stop Extraction
Figure 2.4: Saliency is comprised of several channels which process an image at a variety of different
scales and then combine those results into a saliency map.
During the computation of visual saliency, orientation filtered maps are created.
These are the responses of the image to Gabor wavelet filters. These indicate edges in the
image. Since each filter is tuned to a single preferred orientation, a response from a filter
indicates an edge that is pointed in the direction of preference. The responses from the
filters are stored in individual feature maps. One can think of a feature map as simply an
image which is brightest where the filter produces its highest response. Since the feature
37
maps are computed as part of the saliency code, re-using them can be advantageous from
an efficiency standpoint. From this we create feature maps to find visual junctions and
end-stops in an image by mixing the orientation maps (Figure 2.4). We believe such new
complex feature maps can also tell us about the texture at image locations which can help
give us the gist of objects to be tracked.
The junction and end stop maps are computed as follows. Note that this is a
different computation then the one used in appendix D and chapter 5 in the attention
gate model. At some common point i,j on the orientation maps P the filter responses from
the orientation filters are combined. Here the response to an orientation in one orientation
map ij p is subtracted from an orthogonal map’s orientation filter output orth
ij p and divided
by a normalizer n which is the max value for the numerator. For instance, one orientation
map that is selective for 0 degree angles is subtracted from another map selective for 90
degree angles. This yields the lineyness of a location in an image because where
orthogonal maps overlap in their response is at the junctions of lines.
(2.1) ; {1,2}
orth
k ij ij
ij
p p
a k
n
−
= ∈
We then compute a term (2.2) which is the orthogonal filter responses summed. This is
nothing more than the sum of the responses in two orthogonal orientation maps.
38
Figure 2.5: The three images on the right are the results of the complex junction channel after ICA/PCA
processing from the original image on the left. As can be seen it does a reasonable job of finding both
junctions and end stops.
(2.2) ; {1, 2}
orth
k ij ij
ij
p p
b k
n
+
= ∈
The individual line maps are combined as:
(2.3)
1 2
ij ij
ij
a a
n
α
+
=
This gives the total lineyness for all orientations. We then do a similar thing for our total
response maps:
(2.4)
1 2
ij ij
ij
b b
n
β
−
=
The final junction map γ is then computed by subtracting the lineyness term from the
total output of the orientation filters:
(2.5) ij ij ij γ =α − β
39
Since the junction map is computed by adding and subtracting orientation maps which
have already been computed during the saliency computation phase, we gain efficiency
we wouldn’t have had if we were forced to convolve a whole new map by a kernel filter.
Thus, this junction filter is fairly efficient since it does not require any further
convolution to compute. Figure 2.5 shows the output and it can be seen that it is effective
at finding junctions and end-stops.
2.2.1.3.2 ICA/PCA
We decrease the dimensionality of each feature vector by using a combination of
Independent Component Analysis (ICA) (Bell & Sejnowski, 1995) and Principle
Component Analysis (PCA) (Jollife, 1986). This is done using FastICA (Hyvärinen,
1999) to create ICA un-mixing matrices offline. The procedure for training this is to
extract a large number of features from a large number of random images. We generally
use one to two hundred images and 300 points from each image using the Monte Carlo
selection processes just described. FastICA first determines the PCA reduction matrix
and then determines the matrix that maximizes the mutual information using ICA.
Unmixing matrices are computed for each type of feature across scales. So as an
example, the red-green opponent channel is computed at different scales, usually six.
PCA/ICA will produce a reduced set of two opponent maps from the six original scale
maps (This is described in detail later and can be seen in figure 2.7). Using ICA with
PCA helps to ensure that we not only reduce the dimension of our data set, but that the
information sets are fairly unique. From the current data, we reduce the total number of
dimensions with all channels from 72 to 14 which is a substantial efficiency gain
40
especially given the fact that some modules have complexity O(d2) for d number of
feature channels (dimensions).
Figure 2.6: NPclassify works by (A) first taking in a set of points (feature vectors) (B) then each point is
assigned a density which is the inverse of the distance to all other points (C) Points are then linked by
connecting a point to the nearest point which has a higher density (D) Very long links (edges) are cut if
they are for instance statistically longer than most other links. This creates separate classes.
2.2.1.4 Classification Modules
2.2.1.4.1 Classification of Features with NPclassify5
Features are initially classified using a custom non-parametric clustering
algorithm called NPclassify6. The idea behind the design of NPclassify is to create a
5 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/NPclassify2.C/.H
6 A description and additional information on top of what will be discussed can be found at:
http://www.nerd-cam.com/cluster-results/.
41
clustering mechanism which has soft parameters that are learned and are used to classify
features. We define here soft parameters as values which define the shape of a meta-prior.
This might be thought of as being analogous to a learning rate parameter or a Bayesian
hyperparameter. For instance, if we wanted to determine at which point to cut off a
dataset and decided on two standard deviations from the mean, two standard deviations
would be a soft parameter since the actual cut off distance depends on the dataset.
NPclassify (Figure 2.2, 2.6 and 2.7) (Mundhenk et al., 2004b, Mundhenk et al.,
2005a) works by using a kernel to find the density at every sample point. The currently
used kernel does this by computing the inverse of the sum of the Euclidian distance from
each point to all other points. After density has been computed the sample points are
linked together. This is done by linking each point to the closest point which has a higher
density. This creates a path of edges which ascends acyclically along the points to the
point in the data set which has the highest density of all. Classes are created by figuring
out which links need to be cut. For instance, if a link between two sample points is much
longer than most links, it suggests a leap from one statistical mode to another. This then
may be a good place to cut and create two separate classes. Additionally, classes should
be separated based upon the number of members the new class will have. After classes
have been created, they can be further separated by using interclass statistics.
The advantage to using NPclassify is that we are not required to have a prior
number of classes or any prior information about the spatial or sample sizes of each class.
42
Figure 2.7: On the left are samples of features points with the class boundaries NPclassify has discovered.
Some of the classes have large amounts of noise while others are cramped together rather than being
separated by distance. On the right are the links NPclassify drew in order to create the clusters. Red links
are ones which are too long and were clipped by the algorithm to create new classes.
43
Instead, the modal distribution of the dataset combined with learned notions of feature
connectedness determine whether a class should be created. So long as there is some
general statistical homogeneity between training and testing datasets we should expect
good performance for clustering based classification. The training results are discussed
later in the section on training results.
Figure 2.8: The results using NPclassify are shown next to the same results for k-means on some sham
data. The derived clusters are shown with the Gaussian eignenmatrix bars (derived using the eigenmatix
estimation in section 2.2.1.4.2). In general, NPclassify creates more reliable clusters particularly in the
presence of noise. Additionally, it does so without needing to know a priori how many classes one has.
As such, we do have a few meta-priors still present. The first is a basic kernel
parameter for density. In this case, the Euclidian distance factor makes few assumptions
44
about the distribution other than that related features should clump together. The second
meta-prior is learned as a hyperparameter for a good cutoff. This can be derived using
practically any gradient optimization technique. So it is notable, that NPclassify is not
without some type of prior, but the assumptions on the data is quite relaxed and only
assumes that related feature samples will be close to each other in feature space. An
example of NPclassify working on somewhat arbitrary data points can be seen in figure
2.8.
2.2.1.4.2 Gaussian Generalization and Approximation7
In order to store classes for future processing it is important to generalize them.
Gaussian ellipsoids are used since their memory usage for any class is O(d2) for d number
of dimensions for a given class. Since d is fairly low for us, this is an acceptable
complexity. Additionally, by using Gaussians we gain the power of Bayesian inference
when trying to match feature classes to each other. However, the down side is that
computing the eigen matrix necessary for Gaussian fitting scales minimally as d3 for
dimensions and s2 for the number of samples. That is, it is O(d3 + s2). This is due to the
fact that computing such elements using the pseudo inverse method (or QR
decomposition) involves matrix inversion and multiplication. In order to avoid such large
complexity we have implemented an approximation technique that scales minimally as d2
for dimensions and s for the number of samples - O(sd2). This means that a net savings
happens if the number of samples is much larger than the number of dimensions. So for
7 This component is implemented in the iLab Neuromorphic Vision Toolkit as VFAT/covEstimate.C/.H
45
instance, if there are more than 100 samples and only 10 dimensions, this will produce a
savings over traditional methods.
Figure 2.9: After NPclassify has grouped feature samples together they can be fit with Gaussian
distributions. This helps to determine the probability that some new feature vector belongs to a given class
or that two classes compute in consecutive frames using NPclassify are probably the same class. If the
distributions overlap greatly as on the left figure, then two classes are probably the same class.
The approximation method works by using orthogonal rotations to center and
remove covariance from the data. By recording the processes, we can then compute the
probability on data points by translating and transforming them in the same way to align
with the data set. What we want to be able to do is to tell the probability of data points
belonging to some class as well as being able to tell if two classes derived in consecutive
frames are probably the same class (see figure 2.9)
The first step is to center the data about the origin. This is done by computing the
mean and then subtracting that number from each feature vector. Next we compute
approximate eigenvectors by trying to find the average vector from the origin to all
46
feature vector coordinates. So for k th feature vector, we first compute the ratio between
its distance l from the origin along dimensions j and i. This yields the ratio rijk. That is,
after aligning the feature vector with the origin, we take the ratio of two features in the
same vector (we will do this for all possible feature pairs in the vector).
(2.6) jk
ijk
ik
l
r
l
=
Next we find the Euclidian distance uijk from the origin along dimensions j and i.
(2.7) 2 2
uijk = lik −ljk
By Summing the ratio of rijk and uijk for all k feature vectors, we obtain a mean ratio that
describes the approximated eigenvector along the dimensions i and j.
(2.8)
0
k
ijk
ij
k ijk
r
m
= u
= Σ
A normalizer is computed as the sum of all the distances for all samples k.
(2.9)
0
k
ij ijk
k
n u
=
= Σ
Next we determine the actual angle of the approximated eigenvector along the
dimensions i and j.
(2.10) tan 1 ij
ij
ij
m
n
θ − ⎛ ⎞
= ⎜⎜ ⎟⎟
⎝ ⎠
47
Once we have that, we can rotate the data set along that dimension and measure
the length of the ellipsoid using a basic sum of squares operation. Thus, we compute ρik
and ρjk which is the data set rotated by θij. Here ξ is the positions of kth feature vector
along the i dimension and ψ is the position of the feature vector along the j dimension.
What we are doing here is rotating covariance out along each dimension so that we can
measure the length of the eigenvalue. Thus, we iterate over all data points k and along all
dimensions i and along i+1 dimensions j summing up σ as we go. We only sum j for i+1
since we only need to use one triangle of the eigenvector matrix since it is symmetric
along the diagonal.
(2.11) i +1 ≤ j
(2.12) cos( ) sin( ) ik ij ij ρ =ξ ⋅ θ +ψ ⋅ θ
(2.13) sin( ) cos( ) jk ij ij ρ = −ξ ⋅ θ +ψ ⋅ θ
What we have done is figure out how much we need to rotate the set of feature vectors in
order to align the least squares slope with the axis. Once this is done, we can rotate the
data set and remove covariance. Since the mean is zero because we translated the data set
by the mean to the origin, variance for the sum of squares is computed simply as:
(2.14)
2
0
k
ik
iij
k
s
n
ρ
=
= Σ
(2.15)
2
0
k
jk
jji
k
s
n
ρ
=
= Σ
48
Each sum of squares is used to find the eigenvalue estimate by computing
Euclidian distances. That is, by determining the travel distance of each eigenvector
during rotation and combining that number with the computed sum of squares we can
determine an estimate of the eigenvalue from triangulation. The conditional here is used
because σii is computed more than once with different values for θij. Thus, σii is the sum
of all the products of θij and siij.
(2.16) ( )2
iff = 0
cos( ) otherwise
iij ii
ii
ii iij ij
s
s
σ
σ
σ θ
⎧⎪
= ⎨
+ ⋅ − ⎪⎩
(2.17) ( )2
iff = 0
cos( ) otherwise
jji jj
jj
jj jji ij
s
s
σ
σ
σ θ
⎧⎪
= ⎨
+ ⋅ − ⎪⎩
The end result is a non-standard eigenmatrix which can be used to compute the
probability that a point lies in a Gaussian region. We do this by performing the same
procedure on any new feature vector. That is, we take any new feature vector and replay
the computed translation and rotations to align it with covariance neutral eigenmatrix
approximation. Probability for the feature vector is then computed independently along
each dimension thus eliminating further matrix multiplication during the probability
computation.
To summarize, by translating and rotating the feature set, we have removed
covariance so we can compute probabilities assuming dimensions do not interact. In
essence this removes the need for complex matrix operations. While the complexity is
high, it is one order lower than the standard matrix operations as was mentioned earlier.
49
Examples of fits created using this method can be seen in figure 2.7 where NPclassify
has created classes and the eigenmatrix is estimated for the ones created.
2.2.1.4.3 Feature Contiguity, Biasing and Memory
Once features have been classified we want to use them to perform various tasks.
These include target tracking, target identification and feature biasing. Thus from a
measurement of features from time t, we would like to know if a collection of features at
time t+1 is the same, and as such either the same object or a member of the same object.
By using Bayesian methods we can link classes of features in one frame of a
video to classes in the next frame by tying a class to another which is its closest
probabilistic match. Additionally, we use the probability to bias how the non-parametric
classifier and saliency work over consecutive frames. For NPclassify we add a sink into
the density computation. That is, we create a single point whose location is the mean of a
class with the mass of the entire class. Think of this as dropping a small black hole in a
galaxy that represents the mass of the other class. By inserting this into the NPclassify
computation, we skew the density computation towards the prior statistics in the last
iteration. This creates a Kalman filter like effect that smoothes the computation of classes
between frames. This is a reasonable action since the change in features from one frame
to the next should be somewhat negligible.
2.2.1.5 Complex Feature Tracker Methods and Results
2.2.1.5.1 Complexity and Speed
One of the primary goals of VFAT is that it should be able to run in real time.
This means that each module should run for no more than about 30 ms. Since we are
using a Beowulf cluster, we can chain together modules such that even if we have several
50
steps that take 30 ms each, by running them on different machines we can create a vision
pipeline whereby a module finishes a job and hands the results to another machine in a
Beowulf cluster that is running the next process step.
In time trials the modules run within real time speeds. Using a Pentium 4 2.4
GHz Mobile Processor with 1 GB of RAM, each module of VFAT runs at or less than 30
ms. The longest running module is the NPclassify feature classifier. If given only 300
features it runs in 23 ms, for 600 features it tends to take as long as 45 ms. On a newer
system it should be expected to run much faster.
2.2.1.5.2 Training for Classification
Table 2.1: Following PCA the amount of variance accounted for was computed for each type of feature
channel. Each channel started with six scales (dimensions). For many channels, 90% of variance is
accounted for after a reduction to two dimensions. For all others, no more than three dimensions are needed
to account for 90% of variance.
Two modules in VFAT need to be trained prior to usage. These include ICA/PCA
and NPclassify. Training for both has been designed to be as simple as possible in order
to maintain the ease of use goal of the iRoom project. Additionally and fortunately,
training of both modules is relatively quick with ICA/PCA taking less than a minute
using the FastICA algorithm under Matlab and NPclassify taking around two hours using
51
gradient descent training. Since we only need to ever train once, this is not a prohibitive
amount of time.
2.2.1.5.3 Training ICA/PCA
Figure 2.10: The various conspicuity maps of the feature channels from the saliency model are shown here
ICA/PCA reduced.
Training was completed by using 145 randomly selected natural images from a
wide range of different image topics. Images were obtained as part of generic public
domain CD-ROM photo packages, which had the images sorted by topic. This enabled us
to ensure that the range of natural images used in training had a high enough variety to
prevent bias towards one type of scene or another. For each image, 300 features were
extracted using the Monte Carlo / Visual saliency method described earlier. In all this
52
gave us 43,500 features to train ICA/PCA on. The results are shown on table 2.1. For
most channels, a reduction to two channels from six still allowed for over 90% of
variance to be accounted for. However, directional channels that measure direction of
motion and orientation of lines in an image needed three dimensions to still account for
more than 90% of all variance. Assuming that the data is relatively linear and a good
candidate for PCA reduction, this suggests that we can effectively reduce the number of
dimensions to less than half while still retaining most of the information obtained from
feature vectors.
Visual inspection of ICA/PCA results seems to show the kind of output one
would expect (Figure 2.10 and 2.11). For instance, when two channels are created from
six, they are partially a negation to each other. On the red/green channel, one of the
outputs seems to show a preference for red. However, the other channel does not
necessarily show an anti-preference for red. This may suggest that preferences for colors
may also depend on the scales of the images. That is, since what makes the six input
images to each channel different is the scale at which they are processed, scale is the
most likely other form of information processed by ICA/PCA. This might mean for
instance that the two channels of mutual information contain information about scaling.
We might guess that of the two outputs from the red/green channel, one might be a
measure of small red and the other of large green things. If this is the case it makes sense
since in nature, red objects tend to be small (berries, nasty animals, etc.) while green
things tend to be much more encompassing (trees, meadows, ponds).
53
Figure 2.11: From the original image we see the results of ICA/PCA on the red/green and blue/yellow
channels. As can be seen some parts of the outputs are negations of each other which makes sense since
ICA maximizes mutual information. However, close examination shows they are not negatives. It is
possible that scale information applies as a second input type and prevents obvious negation.
2.2.1.5.4 Training NPclassify
To hone the clustering method we use basic gradient decent with sequential
quadratic programming using the method described by (Powell, 1978). This was done
offline using the Matlab Optimization Toolbox. For this study, error was defined as the
number of classes found versus how many it was expected to find (see Figure 2.12).
Thus, we presented the clustering algorithm with 80 natural training images. Each image
54
Figure 2.12: In this image there are basically three objects. NPclassify has found two (colors represent the
class of the location). This is used as the error to train it. So for 80 images it should find x number of
objects. The closer it gets to this number, the better. Notice that the points are clustered in certain places.
This is due to the saliency/Monte Carlo method used for feature selection.
had a certain number of objects in it. For instance an image with a ball and a wheel in it
would be said to have two objects. The clustering algorithm would state how many
classes it thought it found. If it found three classes in an image with two objects then the
error was one. The error was computed as average error from the training images. The
training program was allowed to adjust any of several hard or soft parameters for
NPclassify during the optimization.
The training data was comprised of eight base objects of varying complexity such
as balls and a wheel on the simple side or a mini tripod and web cam on the more
55
complex side. Objects were placed on a plain white table in different configurations.
Images contained different numbers of objects as well. For instance some images
contained only one object at a time, while other contained all eight. A separate set of
validation images was also created. These consisted of a set of eight different objects
with a different lighting created by altering the f-stop on the camera. Thus, the training
images were taken with an f-stop of 60 while the 83 validation images were taken with an
f-stop of 30. Additionally, the angle and distance of view point is not the same between
the training and validation sets. The validation images were not used until after optimal
parameters were obtained by the training images. Then the exact same parameters were
used for the validation phase.
Our first test was to examine if we could at the very least segment images such
that the program could tell which objects were different from each other. For this test
spatial interaction was taken into account. We did this by adding in spatial coordinates as
two more features in the feature vectors with the new set of 14 ICA/PCA reduced feature
vectors. The sum total of spatial features were weighted about the same as the sum total
of non-spatial features. As such, the membership of an object in one segmented class or
the other was based half by its location in space and half by its base feature vector
composition. Reliability was measured by counting the number of times objects were
classified as single objects, the number of times separate objects were merged as one
object and the number of time a single object was split into two unique objects.
Additionally, there was a fourth category for when objects were split into more than three
objects. This was small and contained only four instances.
56
The results were generally promising in that based upon simple feature vectors
alone, the program was able to segment objects correctly with no splits or merges in 125
out of the 223 objects it attempted to segment. In 40 instances an object was split into two
objects. Additionally 54 objects were merged as one object. While on the surface these
numbers might seem discouraging there are several important factors to take into
account. The first is that the program was segmenting based solely on simple features
vectors with a spatial cue. As such it could frequently merge one shiny black object into
another shiny black object. In 62 % of the cases of a merger, it was obvious that the
merged objects were very similar with respect to features.
2.2.1.5.5 NPclassify v. K-Means
NPclassify was also tested on its general ability to classify feature clusters. In this
case it was compared with K-means. However, since K-means requires the number of
classes to be specified a priori, this was provided to it. So in essence, the K-means
experiment had the advantage of knowing how many classes it would need to group,
while NPclassify did not.
The basic comparison test was similar to the test presented in the previous section.
In this case, several Gaussian like clusters were created of arbitrary 2 dimensional
features. They had between 1 and 10 classes in each data set. 50 of the sets were clean
with no noise such that all feature vectors belonged explicitly to a ground truth class.
However, in 50 other sets, small amounts of random noise were added. The comparison
metric for K-means and NPclassify was how often classes were either split or merged
57
when they should have not been. The mean error for both conditions is shown below in
figure 2.13. It should be noted that while K-means may be sensitive to noise in data, it is
used here since it is well known and can serve as a good base line for any clustering
algorithm.
Figure 2.13: NPclassify is compared with K-Means for several data sets. The error in classification for
different sets is the same if there is little noise in the data. However, after injecting some noise, NPclassify
performs superior.
The general conclusion is that compared with K-means, NPclassify is superior
particularly when there is noise in the data. This is not particularly surprising since as a
spanning tree style algorithm, NPclassify can ignore non proximal data points much more
easily. That is, K-means is forced to weigh in all data points and really has no innate
ability to determine that an outlying data point should be thrown away. However,
NPclassify will detect the jump in distance to an outlier or noise point from the central
density of the real class.
58
2.2.1.5.6 Contiguity
Figure 2.14: Tracking from frame 299 to frame 300 the shirt on the man is tracked along with the head
without prior knowledge of what is to be tracked. It should be noted that that while the dots are drawn in
during simulation, the ellipses are drawn in by hand for help in illustration in gray scale printing.
Contiguity has been tested but not fully analyzed (Figure 2.14). Tracking in video
uses parameters for NPclassify obtained in section 2.2.1.5.4. Thus, the understanding of
how to track over consecutive frames is based on the computers subjective understanding
of good continuity for features. In general, classes of features can be tracked for 15 to 30
frames before the program loses track of the object. This is not an impressive result in
and of itself. However, several factors should be noted. First is that each object that
VFAT is tracking is done so without priors for what the features of each should be. Thus,
the program is tracking an object without having been told to either track that object or
what the object its tracking should be like. The tracking is free form and in general
without feature based priors. The major limiter for the contiguity of tracking is that an
object may lose saliency as a scene evolves. As such an object if it becomes too low in
saliency will have far fewer features selected for processing from it, which destroys the
track of an object with the current feature qualities. However, as will be noted in the next
59
section, this is not a problem since this tracker is used to hand off trackable objects to a
simple tracker which fixates much better on objects to be tracked.
2.3 The Simple Feature Based Tracker8
Figure 2.15: The Simple tracker works by taking in initial channel values such as ideal colors. These are
used to threshold an image and segment it into many candidate blobs. This is done by connecting pixels
along scan lines that are within the color threshold. The scan lines are then linked which completes a
contiguous object into a blob. The blobs can be weeded out if they are for instance too small or too far from
where the target last appeared. Remaining blobs can then be merged back and analyzed. Finding the center
of mass of the left over blobs gives us the target location. By finding the average color values in the blob,
we can define a new adapted color for the next image frame. Thus, the threshold color values can move
with the object.
Once a signature is extracted using the complex tracker described in the previous
section, it can be feed to a faster and simpler tracking device. We use a multi channel
8 For more information see also: http://ilab.usc.edu/wiki/index.php/Color_Tracker_How_To
60
tracker, which uses color thresholding to find candidate pixels and then links them
together. This allows it to not only color threshold an image, but to segregate blobs and
analyze them separately. So for instance, if it is tracking a yellow target, if another yellow
target appears, it can distinguish between the two. Additionally, the tracker also computes
color adaptation as well as adaptation over any channel it is analyzing. We compute for
instance a new average channel value c (2.18) as the sum of all pixel values in this
channel p c over all N pixels in tracked ‘OK’ blobs (as seen in figure 2.15) p from the
current frame t to some past frame t′ . In basic terms, this is just the average channel
value for all the trackable pixels in several consecutive past frames. Additionally we
computeσ , which is just the basic standard deviation over the same pixels.
(2.18)
( )2
0 and 0
1
t N t N
ip ip
i t p i t p
t t
i i
i t i t
c cc
c
N N
σ = ′ = = ′ =
= ′ = ′
−
= =
−
ΣΣ ΣΣ
Σ Σ
Currently, we set a new pixel as being a candidate for tracking if for all channels
that have a pixel value p c :
(2.19) p c −α ⋅σ ≤ c ≤ c +α ⋅σ
Thus, a pixel is thresholded and selected as a candidate if it falls within the
boundary of each channel that is its mean value computed from eq. (2.18) +/- the product
of the standard deviation and a constantα . Forgetting is accomplished in the adaptation
by simply windowing the sampling interval.
61
This method allows the tracker to track a target even if its color changes due to
changes in lighting. It should be noted that the simple tracker can track other features in
addition to color so long as one can create a channel for it. That is, an RGB image can be
separated into three channels, which are each gray scale images. In this case, we create
one for red, one for green and one for blue. We can also create images that are for
instance, the responses of edge orientation filters or motion filters. These can be added as
extra channels in the simple tracker in the same manner. However, to preserve luminance
invariance we use the H2SV color scheme described in appendix G. This is just an
augmentation of HSV color space that solves for the singularity at red by converting hue
into Cartesian coordinates.
In addition to the basic vision functional components of the simple tracker, its
code design is also important. The tracker is object oriented which makes it easy to create
multiple independent instances of the simple tracker. That is, we can easily run several
simple trackers on the same computer each tracking different objects from the same video
feed. The computational work for each tracker is fairly low and four independent trackers
can simultaneously process 30 frames per second on an AMD Athlon 2000 XP processor
based machine. This makes it ideal for the task of tracking multiple targets at the same
time.
2.4 Linking the Simple and Complex Tracker
In order for the simple tracker and the complex tracker to work together they have
to be able to share information about a target. As such the complex tracker must be able
to extract information about objects that is useful to the simple tracker (Figure 2.16).
62
Additionally, linking the simple tracker with the complex tracker creates an interesting
problem with resource allocation. This is because each simple tracker we instantiate
tracks one target at a time while the complex tracker has no such limit. A limited number
of simple trackers can be created and there must be some way to manage how they are
allocated to a task based on information from the complex tracker.
Figure 2.16: The simple and complex trackers are linked by using the complex tracker to notice and
classify features. The complex tracker then places information about the feature classes into object feature
class signatures. The complex tracker uses these signatures to keep track of objects over several frames or
to bias the way in which it classifies objects. The signatures are also handed to simple trackers, which track
the objects with greater proficiency. Here we see two balls have been noticed and signatures have been
extracted and used to assign each ball to its own tracker. The smaller target boxes on the floor show that the
simple tracker was handed an object (the floor), which it does not like and is not tracking. Thus, the simple
tracker has its own discriminability as was mentioned in section 2.3 and figure 2.15.
We address the first problem by making sure both trackers work with similar
feature sets. So for example, the complex tracker when it runs will examine the H2SV
color of all the classes it creates. It then computes the mean color values for each class.
This mean color value along with the standard deviation of the color can then be handed
to the simple tracker, which uses it as the statistical prior color information for the object
it should track.
63
Figure 2.17: This is a screen grab from a run of the combined tracker. The lower left two images show the
complex tracker noticing objects, classifying and tracking them. The signature is handed to the simple
tracker, which is doing the active tracking in the upper left window. The combined tracker notices the man
entering the room and tracks him without a priori knowledge of how he or the room looks. Once he walks
off the right side, the tracker registers a loss of track and stops tracking. The bars on the right side show the
adapted actively tracked colors from the simple tracker in H2SV color. The lower right image shows that
many blobs can fit the color thresholds in the simple tracker, but most are selected out for reasons such as
expected size, shape and position.
The second issue of resource allocation is addressed less easily. However, there
are simple rules for keeping resource allocation under control. First, don’t assign a simple
tracker to track an object that overlaps with a target another simple tracker is tracking in
the same camera image. Thus, don’t waste resources by tracking the same target with two
or more trackers. Additionally, since the trackers are adaptive we can find that two
trackers were assigned to the same target, but we didn’t know this earlier. For instance, if
64
accidentally one simple tracker is set to track the bottom of a ball and one the top of the
ball, after a few iterations of adaptation, both trackers will envelop the whole ball. It is
thus advantageous to check for overlap later. If we find this happening, we can dismiss
one of the simple trackers as redundant. Additionally, our finite resources mean we do
not assign every unique class from the complex tracker to a simple tracker. Instead, we
try and quantify how interesting a target is. For instance, potential targets for the simple
tracker may be more interesting if they are moving, have a reasonable mass or have been
tracked by the complex tracker for a long enough period of time.
2.5 Results
On the test videos used, the system described seems to work very well. A video of
a man entering and leaving a room (Figure 2.17) was shown five times to the combined
complex and simple tracker. In each run, the man was noticed within a few frames of
entering the cameras view. This was done without prior knowledge of how the target
should appear and without prior knowledge of the room’s appearance. The features were
extracted and a simple tracker was automatically assigned to track the man, which did so,
until he left the room, at which point the simple tracker registered a loss of track.
Interestingly enough, the tracker extracted a uniform color over both the man’s shirt and
his skin. It was thus able to, on several instances, track the man as both his shirt and his
skin. Thus, even though the shirt was burgundy and the skin reddish, the combined
tracker was able to find a statistical distribution for H2SV color that encompassed the
color of both objects as unique from the color of objects in the rest of the room.
65
The tracker was also tested on a video where a blue and yellow ball both swing on
a tether in front of the camera. In five out of five video runs, both balls are noticed and
their features extracted. Each ball is tracked as a separate entity by being assigned by the
program its own simple tracker. Each ball is tracked until it leaves the frame, at which
point the simple trackers register a loss of track. The balls even bounce against each
other, which demonstrates that the tracker will trivially discriminate between objects even
when they are touching or overlapping.
In both video instances, objects are tracked without the program knowing the
features of the object to be tracked a priori. Instead, saliency is used to notice different
possible targets and the complex tracker is used to classify possible targets into classes.
This was then used to hand target properties to the simple trackers as automatically
generated prior information about the targets to be tracked. Additionally, the simple
tracker will register a loss of track when the target leaves the field of view. This allows us
to not only notice when a new target enters our field, but also when it leaves.
The tracking was also aided by the use of H2SV color. Prior to using the H2SV
color scheme, the purple shirt the man is wearing was split as two objects since the color
of many of the pixels bordered on and even crossed into the red part of the hue spectrum.
Thus, standard HSV created a bi-modal distribution for hue. The usage of H2SV allowed
us to now track the purple shirt as well as objects that are reddish in hue, such as skin.
H2SV color also works for tracking of objects in the center of the spectrum, which we
observed by tracking objects that are green, yellow and blue.
In addition to tracking using a static cameral, the same experiment was done using
a moving camera. This is much less trivial since the common method of eigen
66
background subtraction cannot be used to determine new things in a scene from the
original scene. Again the tracker was able to track a human target without prior
knowledge of features even as the camera moved. This is a distinct advantage for our
tracker and illustrates the advantage of using saliency to extract and bind features since it
can compensate for global motion.
2.6 Discussion
2.6.1 Noticing
The most notable and important aspect of the current work is that we are able to
track objects or people without knowing what they will look like a priori and we are able
to do so quickly enough for real time applications. Thus, we can notice, classify and track
a target fairly quickly. This has useful applications in many areas and in particular
security. This is because we track something based on how interesting it is and not based
on complete prior understanding of its features. Potentially, we can then track any object
or person even if they change their appearance. Additionally, since we extract a signature
that describes a target that is viewpoint invariant, this information can be used to share
target information with other agents.
2.6.2 Mixed Experts
Additionally, we believe we are demonstrating a better paradigm in the
construction of intelligent agents, one that uses a variety of experts to accomplish the
task. The idea is to use a variety of solutions that work on flexible weak meta-prior
information, but then use their output as information for a program that is more biased.
67
This is founded on the idea that there is no perfect tool for all tasks and that computer
vision is comprised of many tasks such as identification, tracking and noticing. To
accomplish a complex task of noticing and tracking objects or people, it may be most
optimal to utilize many different types of solutions and interact them. Additionally, by
mixing experts in this way, no one expert necessarily needs to be perfect at its job. If the
experts have some ability to monitor one another, then if one expert makes a mistake, it
can possibly be corrected by another expert. It should be noted that this tends to follow a
biological approach in which the human brain may be made up of interacting experts, all
of which are interdependent on other expert regions in order to complete a variety of
tasks.
Another important item to note in the mixed experts paradigm is that while it may
make more intuitive sense to use such an approach, new difficulties arise as our system
becomes more abstractly complex. So as an example, if one works with support vector
machines only, then one has the advantage of a generally well-understood mathematical
framework. It is easier to understand a solutions convergence, complexity and stability in
a system if it is relatively homogeneous. When a person mixes experts, particularly if the
experts act very differently, the likelihood of the system doing something unexpected or
even catastrophic tends to increase. Thus, when one designs an intelligent agent with
mixed experts, system complexity should me managed carefully.
2.6.3 System Limitations and Future Work
The system described has its own set of limitations. The work up to this point has
concentrated on being able to notice and track objects in a scene quickly and in real time.
68
However, its identification abilities are still somewhat limited. It does not contain a
memory such that it can store and identify old targets in the long term. However, such an
ability is in the works and should be aided by the ability of the tracking system to narrow
the area of the image that needs to be inspected which should increase the speed of visual
recognition
69
Chapter 3: Contour Integration and Visual Saliency
In the visual world there are many things, which we can see, but certain features, sets of
features and other image properties tend to more strongly draw our visual attention
towards them. A very simple example is a stop sign, in which combinations of red color
and angular features of an octagon combine with a strong word “stop” to create
something that hopefully we would not miss if we come upon it. Such propensity of some
visual features to attract attention defines in part the phenomenon of visual saliency. Here
we assert, as others (James, 1890, Treisman & Gelade, 1980, Koch & Ullman, 1985, Itti
& Koch, 2001b) that saliency is drawn from a variety of factors. At the lowest levels,
color opponents, unique orientations and luminance contrasts create the effect of visual
pop-out (Treisman & Gelade, 1980, Wolfe, O'Neill & Bennett, 1998). Importantly, these
studies have highlighted the role of competitive interactions in determining saliency ---
hence, a single stop sign on a natural scene backdrop usually is highly salient, but the
saliency of that same stop sign and its ability to draw attention is strongly reduced as
many similar signs surround it. At the highest levels it has been proposed