Originally the only way to look inside the human body without opening it up was by means of two dimensional (2D)
images obtained using X-ray equipment. The fact that human anatomy is inherently three dimensional leads to
ambiguities in interpretation and problems of occlusion. Three dimensional (3D) imaging modalities such as CT, MRI
and 3D ultrasound remove these drawbacks and are now part of routine medical care. While most hospitals 'have gone
digital', meaning that the images are no longer printed on film, they are still being viewed on 2D screens. However, this
way valuable depth information is lost, and some interactions become unnecessarily complex or even unfeasible. Using a
virtual reality (VR) system to present volumetric data means that depth information is presented to the viewer and 3D
interaction is made possible. At the Erasmus MC we have developed V-Scope, an immersive volume visualization
system for visualizing a variety of (bio-)medical volumetric datasets, ranging from 3D ultrasound, via CT and MRI, to
confocal microscopy, OPT and 3D electron-microscopy data. In this talk we will address the advantages of such a system
for both medical diagnostics as well as for (bio)medical research.

The egg-rolling behavior of the graylag goose is an often quoted example of a fixed-action pattern. The bird will
even attempt to roll a brick back to its nest! Despite excellent visual acuity it apparently takes a brick for an
egg." Evolution optimizes utility, not veridicality. Yet textbooks take it for a fact that human vision evolved
so as to approach veridical perception. How do humans manage to dodge the laws of evolution? I will show
that they don't, but that human vision is an idiosyncratic user interface. By way of an example I consider the
case of pictorial perception. Gleaning information from still images is an important human ability and is likely
to remain so for the foreseeable future. I will discuss a number of instances of extreme non-veridicality and
huge inter-observer variability. Despite their importance in applications (information dissemination, personnel
selection,...) such huge effects have remained undocumented in the literature, although they can be traced to
artistic conventions. The reason appears to be that conventional psychophysics-by design-fails to address the
qualitative, that is the meaningful, aspects of visual awareness whereas this is the very target of the visual arts.

Some people are born with an intuitive sense of good composition. They do not need to be taught
composition, and their work is immediately perceived as being well by other people. In an attempt to help
others learn composition, art critics, scientists and psychologists analyzed well-compose works in the hope of
recognizing patterns and trends that anyone could employ to achieve similar results.
Unfortunately, the identified patterns are by no means universal. Moreover, since a compositional rule is
useful only as long as it enhances the idea that the artist is trying to express, there is no objective standard to
judge whether a given composition is "good" or "bad". As a result, the study of composition seems to be full
of contradictions. Nevertheless, there are several basic "low level" rules supported by physiological studies in
visual perception that artists and photographers intuitively obey.
Regardless of image content, a prerequisite for all good images is that their respective composition would be
balanced. In a balanced composition, factors such as shape, direction, location and color are determined in a
way that is pleasant to the eye. An unbalanced composition looks accidental, transitory and its elements show
a tendency to change place or shape in order to reach a state that better reflects the total structure. Under these
conditions, the artistic statement becomes incomprehensive and confusing.

In the design of professional luminaires, improving visibility has always been a core target. Recently, it has become
clearer that especially for consumer lighting, generating an appropriate atmosphere and pleasant feeling is of almost
equal importance. In recent studies it has been shown that the perception of an atmosphere can be described by four
variables: cosiness, liveliness, tenseness, and detachment. In this paper we compare the perception of these lighting
characteristics when viewed in reality with the perception when viewing a simulated picture. Replacing reality by a
picture on a computer screen such as an LCD monitor, or a piece of paper, introduces several differences. These include
a reduced dynamic range, reduced maximum brightness and quantization noise in the brightness levels, but also in a
different viewing angle, and a different adaption of the human visual system. Research has been done before to compare
simulations with photographs, and simulations with reality. These studies have focused on 'physical variables', such as
brightness and sharpness, but also on naturalness and realism. We focus on the accuracy of a simulation for the
prediction of the actual goal of a lot of luminaires: atmosphere creation. We investigate the correlation between
perceptual characteristics of the atmosphere of a real-world scene and a simulated image of it. The results show that for
all 4 tested atmosphere words similar main effects and similar trends (over color temperature, fixtures, intensities) can be
found in both the real life experiments and the simulation experiments. This implies that it is possible to use simulations
on a screen or printout for the evaluation of atmosphere characteristics.

A key aspect of image effectiveness is how well the image visually communicates the main subject. In consumer images,
two important features that impact viewer appreciation of the main subject are the amount of clutter and the main subject
placement within the image. Two subjective experiments were conducted to assess the relationship between aesthetic
and technical quality and perception of clutter and image center. For each experiment, 30 participants evaluated the same
70 images, on 0 to 100-point scales for aesthetic and technical quality. For the clutter experiment, participants also
evaluated the images, on 0 to 100-point scales for amount of clutter and main subject emphasis. For the center
experiment, participants pointed directly onto the image to mark the center of interest. Results indicate that aesthetic
quality, technical quality, amount of clutter, and main subject emphasis are strongly correlated. Based on 95%
confidence ellipses and mean-shift clustering, expert main subject maps are consistent with observer identification of
main subject location. Further, the distribution of the observer identification of the center of interest is related to the
object class (e.g., person, scenery). Additional features related to image composition can be used to explain clusters formed by patterns of mean ratings.

Visual cognition is of significant importance in certain imaging applications, such as security and surveillance. In
these applications, an important issue is to determine the cognition threshold, which is the maximum distortion
level that can be applied to the images while still ensuring that enough information is conveyed to recognize
the scene. The cognition task is usually studied with images that represent the scene in the visible part of the
spectrum. In this paper, our goal is to evaluate the usefulness of another scene representation. To this end, we
study the performance of near-infrared (NIR) images in cognition. Since surface reflections in the NIR part of the
spectrum is material dependent, an object made of a specific material is more probable to have uniform response
in the NIR images. Consequently, edges in the NIR images are likely to correspond to the physical boundaries
of the objects, which are considered to be the most useful information for cognition. This feature of the NIR
images leads to the hypothesis that NIR is better than a visible scene representation to be used in cognition
tasks. To test this hypothesis, we compared the cognition thresholds of NIR and visible images performing a
subjective study on 11 scenes. The images were compressed with different compression factors using JPEG2000
compression. The results of this subjective test show that recognizing 8 out of the 11 scenes is significantly easier
based on the NIR images when compared to their visible counterparts.

The use of gesture as a natural interface plays an utmost important role for achieving intelligent Human Computer
Interaction (HCI). Human gestures include different components of visual actions such as motion of hands, facial
expression, and torso, to convey meaning. So far, in the field of gesture recognition, most previous works have
focused on the manual component of gestures. In this paper, we present an appearance-based multimodal gesture
recognition framework, which combines the different groups of features such as facial expression features and
hand motion features which are extracted from image frames captured by a single web camera. We refer 12 classes
of human gestures with facial expression including neutral, negative and positive meanings from American Sign
Languages (ASL). We combine the features in two levels by employing two fusion strategies. At the feature level,
an early feature combination can be performed by concatenating and weighting different feature groups, and
LDA is used to choose the most discriminative elements by projecting the feature on a discriminative expression
space. The second strategy is applied on decision level. Weighted decisions from single modalities are fused in
a later stage. A condensation-based algorithm is adopted for classification. We collected a data set with three
to seven recording sessions and conducted experiments with the combination techniques. Experimental results
showed that facial analysis improve hand gesture recognition, decision level fusion performs better than feature
level fusion.

Common controls for photographic editing can be difficult to use and have a significant learning curve. Often, a
user does not know a direct mapping from a high-level concept (such as "soft") to the available parameters or
controls. In addition, many concepts are subjective in nature, and the appropriate mapping may vary from user
to user. To overcome these problems, we propose a system that can quickly learn a mapping from a high-level
subjective concept onto low- level image controls using machine learning techniques. To learn such a concept, the
system shows the user a series of training images that are generated by modifying a seed image along different
dimensions (e.g., color, sharpness), and collects the user ratings of how well each training image matches the
concept. Since it is known precisely how each modified example is different from the original, the system can
determine the correlation between the user ratings and the image parameters to generate a controller tailored
to the concept for the given user. The end result - a personalized image controller - is applicable to a variety
of concepts. We have demonstrated the utility of this approach to relate low-level parameters, such as color
balance and sharpness, to simple concepts, such as "lightness" and "crispness," as well as more complex and
subjective concepts, such as "pleasantness." We have also applied the proposed approach to relate subband
statistics (variance) to perceived roughness of visual textures (from the CUReT database).

In this paper, we present the results of a study designed to investigate the visual factors which
contribute to the perceived quality of synthesized textures. A psychophysical experiment was
performed in which subjects rated the quality of textures synthesized from a variety of modern
texture-synthesis algorithms. The ratings were given in terms of how well each synthesized texture
represented a sample from the same material from which the original texture was obtained. The
results revealed that the most detrimental artifact was lack of structural details. Other pronounced
artifacts included: (1) misalignment of the texture patterns; (2) blurring introduced in the texture
patterns; and (3) repeating the same patch again and again (tiling). Based on these results, we present
an analysis of the efficacy of various measureable parameters at predicting the ratings. We show how
a linear combination of the parameters from a parametric texture-synthesis algorithm demonstrates
better performance at predicting the ratings compared to traditional quality-assessment algorithms.

It is widely believed that the phase spectrum of an image contributes much more to the image's visual appearance
than the magnitude spectrum. Several researchers have also shown that this phase information can
be computed indirectly from local magnitude information, a theory which is consistent with the physiological
evidence that complex cells respond to local magnitude (and are insensitive to local phase). Recent studies have
shown that tasks such as image recognition and categorization can be performed using only local magnitude
information. These findings suggest that the human visual system (HVS) uses local magnitude to infer global
phase (image-wide phase spectrum) and thereby determine the image's appearance. However, from a signal-processing
perspective, both local magnitude and local phase are related to global phase. Moreover, in terms
of image quality, distorting the local phase can result in a severely degraded image. These latter facts suggest
that the HVS uses both local magnitude and local phase to determine an image's appearance. We conducted
an experiment to quantify the contributions of local magnitude and local phase toward image appearance as a
function of spatial frequency. Hybrid images were created via a complex wavelet transform in which the the
low frequency magnitude, low frequency phase, high frequency magnitude, and high frequency phase were taken
from 2-4 different images. Subjects were then asked to rate how much each of the 2-4 images contributed to the
the appearance of the hybrid image. We found that local magnitude is indeed an important factor for image
appearance; however, local phase can play an equally important role, and in some cases, local phase can dominate
the image's appearance. We discuss the implication of these results in terms of image quality and visual coding.

We use the principles of information visualization to guide the design of systems to best meet the needs of
specific targets group of users, namely biologists who have different tasks involving the visual exploration of
biological networks. For many biologists who explore networks of interacting proteins and genes, the topological
structure of these node-link graphs is only one part of the story. The Cerebral system supports graph layout in
a style inspired by hand-drawn pathway diagrams, where location of the proteins within the cell constrains the
location within the drawing, and functional groups of proteins are visually apparent as clusters. It also supports
exploration of expression data using linked views, to show these multiple attributes at each node in the graph.
The Pathline system attacks the problem of visually encoding the biologically interesting relationships between
multiple pathways, multiple genes, and multiple species. We propose new methods based on the principle that
perception of spatial position is the most accurate visual channel for all data types. The curvemap view is an
alternative to heatmaps, and linearized pathways support the comparison of quantitative display as a primary
task while showing topological information at a secondary level.

The systems approach to biological research emphasises understanding of complete biological systems, rather than a
reductionist focus on tightly defined component parts. Systems biology is naturally interdisciplinary; research groups
active in this area typically contain experimental and theoretical biologists, mathematicians, statisticians, computer
scientists and engineers. A wide range of tools are used to generate a variety of data types which must be integrated,
presented to and analysed by researchers from any and all of the contributing disciplines. The goal here is to create
predictive models of the system of interest; the models produced must also be analysed, and in the context of the data
from which they were generated. Effective, integrated data and model visualisation methods are crucial if scientificallyappropriate
judgments are to be made.
The Nottingham Centre for Plant Integrative Biology (CPIB) takes a systems approach to the study of the root of the
model plant Arabidopsis Thaliana. A rich mixture of data types, many extracted via automatic analysis of individual and
time-ordered sequences of standard CCD and confocal laser microscope images, is used to create models of different
aspects of the growth of the Arabidopsis root. This talk briefly reviews the data sets and flow of information within
CPIB, and discuss issues raised by the need to interpret images of the Arabidopsis root and integrate and present the
resulting data and models to an interdisciplinary audience.

Cellular networks are graphs of molecular interactions within the cell. Thanks to the confluence of genome sequencing
and bioinformatics, scientists are now able to reconstruct cellular network models for more than 1,000
organisms. A variety of bioinformatics tools have been developed to support the visualization and navigation
of cellular network data. Another important application is the use of cellular network diagrams to visualize
and interpret large-scale datasets, such as gene-expression data. We present the Cellular Overview, a network
visualization tool developed at SRI International (SRI) to support visualization, navigation, and interpretation
of large-scale datasets on metabolic networks. Different variations of the diagram have been generated algorithmically
for more than 1,000 organisms. We discuss the graphical design of the diagram and its interactive
capabilities.

The data explosion in the biological sciences has led to many novel challenges for the individual researcher. One of
these is to interpret the sheer mass of data at hand. Typical high-throughput data sets from transcriptomic data can easily
comprise hundred thousand data points. It is thus necessary to provide tools to interactively visualize these data sets in a
way that aids in their interpretation. Thus we have developed the MAPMAN application. This application renders
individual data points from different domains as different glyphs that are color coded to reflect underlying changes in the
magnitude/abundance of the underlying data. In order to augment the human comprehensibility of the biologist domain
experts these data are organized on meaningful pathway diagrams that the biologist has encountered numerous times.
Using these representations together with a high level organization thus helps to quickly realize the main outcome of
such a high throughput study and to further decide on additional tasks that should be performed to explore the data.

The explosion of online scientific data from experiments, simulations, and observations has given rise to an avalanche of
algorithmic, visualization and imaging methods. There has also been enormous growth in the introduction of tools that
provide interactive interfaces for exploring these data dynamically. Most systems, however, do not support the realtime
exploration of patterns and relationships across tools and do not provide guidance on which colors, colormaps or
visual metaphors will be most effective. In this paper, we introduce a general architecture for sharing metadata between
applications and a "Metadata Mapper" component that allows the analyst to decide how metadata from one component
should be represented in another, guided by perceptual rules. This system is designed to support "brushing [1]," in which
highlighting a region of interest in one application automatically highlights corresponding values in another, allowing
the scientist to develop insights from multiple sources. Our work builds on the component-based iPlant
Cyberinfrastructure [2] and provides a general approach to supporting interactive, exploration across independent
visualization and visual analysis components.

Recent advances in video technology and digital cinema have made it possible to produce entertaining 3D stereoscopic
content that can be viewed for an extended duration without necessarily causing extreme fatigue, visual strain and
discomfort. Viewers focus naturally their attention on specific areas of interest in their visual field. Visual attention is an
important aspect of perception and its understanding is therefore an important aspect for the creation of 3D stereoscopic
content. Most of the studies on visual attention have focused on the case of still images or 2D video. Only a very few
studies have investigated eye movement patterns in 3D stereoscopic moving sequences, and how these may differ from
viewing 2D video content. In this paper, we present and discuss the results of a subjective experiment that we conducted
using an eye-tracking apparatus to record observers' gaze patterns. Participants were asked to watch the same set of
video clips in a free-viewing task. Each clip was shown in a 3D stereoscopic version and 2D version. Our results indicate
that the extent of areas of interests is not necessarily wider in 3D. We found a very strong content dependency in the
difference of density and locations of fixations between 2D and 3D stereoscopic content. However, we found that
saccades were overall faster and that fixation durations were overall lower when observers viewed the 3D stereoscopic
version.

The influence of a monocular depth cue, blur, on the apparent depth of stereoscopic scenes will be studied in this paper.
When 3D images are shown on a planar stereoscopic display, binocular disparity becomes a pre-eminent depth cue. But
it induces simultaneously the conflict between accommodation and vergence, which is often considered as a main reason
for visual discomfort. If we limit this visual discomfort by decreasing the disparity, the apparent depth also decreases.
We propose to decrease the (binocular) disparity of 3D presentations, and to reinforce (monocular) cues to compensate
the loss of perceived depth and keep an unaltered apparent depth. We conducted a subjective experiment using a twoalternative
forced choice task. Observers were required to identify the larger perceived depth in a pair of 3D images
with/without blur. By fitting the result to a psychometric function, we obtained points of subjective equality in terms of
disparity. We found that when blur is added to the background of the image, the viewer can perceive larger depth
comparing to the images without any blur in the background. The increase of perceived depth can be considered as a
function of the relative distance between the foreground and background, while it is insensitive to the distance between
the viewer and the depth plane at which the blur is added.

We designed a series of experiments to measure user preference for the noise-detail tradeoff, including tests of the
assumption that all true image detail is preferred. We generated samples with noise-detail tradeoff by designing a
sequence of coring filters with increasing strength. A user study method is developed using magnitude estimation
approach. In the first experiment the coring filter sequence is applied to original video samples without any additional
noise. It is observed that the subjective quality score increases as coring strength is increased, reaches a peak and then
decreases. Thus users prefer slightly cored images compared to original images. In the second experiment the coring
filter sequence is applied to video samples with additive noise of different strength. It is observed that the most preferred
coring strength increases as the amount of noise in the image increases. The results from our experiments can be used to
design parameters for various image/ video post-processing and noise removal algorithms.

Temporal pooling and temporal defects are the two dierences between image and video quality assessment.
Whereas temporal pooling has been the object of two recent studies, this paper focuses on the rarely addressed topic of compression-induced temporal artifacts, such as mosquito noise. To study temporal aspects in subjective quality assessment, we compared the perceived quality of two versions of a mosquito noise corrector: one purely
spatial and the other spatio-temporal. We set up a paired-comparison experiment and choose videos whose compression mainly creates temporal artifacts. Results proved the existence of a purely temporal aspect in video quality perception. We investigate the correlation between subjective results from the experiment and three video
metrics (VQM, MOVIE, VQEM), as well as two temporally-pooled image metrics (SSIM and PSNR). SSIM and PSNR metrics nd the corrected sequences of better quality than the compressed ones but do not distinguish spatial and spatio-temporal processings. The confrontation of those results with the VQM and Movie objective
metrics show that they do not account for this type of defects. A detailed study highlights that either they do not detect them or the response of their temporal component is masked by the one of their spatial components.

It has been observed that electronic magnification of imagery results in a decrease in the apparent contrast of the
magnified image relative to the original. The decrease in perceived contrast might be due to a combination of image blur
and of sub-sampling the larger range of contrasts in the original image. In a series of experiments, we measured the
effect on apparent contrast of magnification in two contexts: either the entire image was enlarged to fill a larger display
area, or a portion of an image was enlarged to fill the same display area, both as a function of magnification power and
of viewing distance (visibility of blur induced by magnification). We found a significant difference in the apparent
contrast of magnified versus unmagnified video sequences. The effect on apparent contrast was found to increase with
increasing magnification, and to decrease with increasing viewing distance (or with decreasing angular size). Across
observers and conditions the reduction in perceived contrast was reliably in the range of 0.05 to 0.2 log units (89% to
63% of nominal contrast). These effects are generally consistent with expectations based on both the contrast statistics
of natural images and the contrast sensitivity of the human visual system. It can be demonstrated that 1) local areas
within larger images or videos will usually have lower physical contrast than the whole; and 2) visibility of 'missing
content' (e.g. blur) in an image is interpreted as a decrease in contrast, and this visibility declines with viewing distance.

This paper studies the impact of freezing of video on quality as experienced by users. Two types of freezes are
investigated. First a freeze where the image pauses, so no frames were lost (frame halt). In the second type of freeze, the
image freezes and skips that part of the video (frame drop). Measuring Mean Opinion Score (MOS) was done by
subjective tests. Video sequences of 20 seconds were displayed for four types of content, to a total of 23 test subjects.
We conclude there is no difference in the perceived quality between frame drops and frame halts. Therefore one model
for single freezes was constructed. According to this model the acceptable freezing time (MOS>3.5) is 0.36 seconds.
Pastrana - Vidal et al. (2004) suggested a relationship between the probability of detection and the duration of the
dropped frames. They also found that it is important to consider not only the duration of the freeze but also the number
of freeze occurrences. Using their relationship between the total duration of the freeze and the number of occurrences,
we propose a model for multiple freezes, based upon our model for single freeze occurrences. A subjective test was
designed to evaluate the performance of the model for multiple freezes. Good performance was found on this data i.e a
correlation higher than 0.9.

Public safety practitioners increasingly use video for object recognition tasks. These end users need guidance regarding
how to identify the level of video quality necessary for their application. The quality of video used in public safety
applications must be evaluated in terms of its usability for specific tasks performed by the end user.
The Public Safety Communication Research (PSCR) project performed a subjective test as one of the first in a series to
explore visual intelligibility in video-a user's ability to recognize an object in a video stream given various conditions.
The test sought to measure the effects on visual intelligibility of three scene parameters (target size, scene motion, scene
lighting), several compression rates, and two resolutions (VGA (640x480) and CIF (352x288)). Seven similarly sized
objects were used as targets in nine sets of near-identical source scenes, where each set was created using a different
combination of the parameters under study. Viewers were asked to identify the objects via multiple choice questions.
Objective measurements were performed on each of the scenes, and the ability of the measurement to predict visual
intelligibility was studied.

The subjective tests used to evaluate image and video quality estimators (QEs) are expensive and time consuming.
More problematic, the majority of subjective testing is not designed to find systematic weaknesses in the evaluated
QEs. As a result, a motivated attacker can take advantage of these systematic weaknesses to gain unfair monetary
advantage. In this paper, we draw on some lessons of software testing to propose additional testing procedures
that target a specific QE under test. These procedures supplement, but do not replace, the traditional subjective
testing procedures that are currently used. The goal is to motivate the design of objective QEs which are better
able to accurately characterize human quality assessment.

In this paper we propose a new dataset for evaluation of image/video quality metrics with emphasis on applications in computer graphics. The proposed dataset includes LDR-LDR, HDR-HDR, and HDR-LDR reference-test video pairs with various types of distortions. We also present an example evaluation of recent image and video quality metrics that were applied in the field of computer graphics. In this evaluation all video sequences were shown on an HDR display, and subjects were asked to mark the regions where they saw differences between test and reference videos. As a result, we capture not only the magnitude of distortions, but also their spatial distribution. This has two advantages: on one hand the local quality information is valuable for computer graphics applications, on the other hand the subjectively obtained distortion maps are easily comparable to the maps predicted by quality metrics.

Several attempts to integrate visual saliency information in quality metrics are described in literature, albeit with
contradictory results. The way saliency is integrated in quality metrics should reflect the mechanisms underlying the
interaction between image quality assessment and visual attention. This interaction is actually two-fold: (1) image
distortions can attract attention away from the Natural Scene Saliency (NSS), and (2) the quality assessment task in itself
can affect the way people look at an image. A subjective study was performed to analyze the deviation in attention from
NSS as a consequence of being asked to assess the quality of distorted images, and, in particular, whether, and if so how,
this deviation depended on the distortion kind and/or amount. Saliency maps were derived from eye-tracking data
obtained during scoring distorted images, and they were compared to the corresponding NSS, derived from eye-tracking
data obtained during freely looking at high quality images. The study revealed some structural differences between the
NSS maps and the ones obtained during quality assessment of the distorted images. These differences were related to the
quality level of the images; the lower the quality, the higher the deviation from the NSS was. The main change was
identified as a shrinking of the region of interest, being most evident at low quality. No evident role for the kind of
distortion in the change in saliency was found. Especially at low quality, the quality assessment task seemed to prevail
on the natural attention, forcing it to deviate in order to better evaluate the impact of artifacts.

We tracked the points-of-gaze of human observers as they viewed videos drawn from foreign films while engaged
in two different tasks: (1) Quality Assessment and (2) Summarization. Each video was subjected to three possible
distortion severities - no compression (pristine), low compression and high compression - using the H.264
compression standard. We have analyzed these eye-movement locations in detail. We extracted local statistical
features around points-of-gaze and used them to answer the following questions: (1) Are there statistical differences
in variances of points-of-gaze across videos between the two tasks?, (2) Does the variance in eye movements
indicate a change in viewing strategy with change in distortion severity? (3) Are statistics at points-of-gaze different
from those at random locations? (4) How do local low-level statistics vary across tasks? (5) How do
point-of-gaze statistics vary across distortion severities within each task?

Utility estimators predict the usefulness or utility of a distorted natural image when used as a surrogate for a
reference image. They differ from quality estimators in that they should provide accurate estimates even when
images are extremely visibly distorted relative to the original, yet are still sufficient for the task. Our group has
previously proposed the Natural Image Contour Evaluation (NICE) utility estimator. NICE estimates perceived
utility by comparing morphologically dilated binary edge maps of the reference and distorted images using the
Hamming distance.
This paper investigates perceptually inspired approaches to evaluating the degradation of image contours in
natural images for utility estimation. First, the distance transform is evaluated as an alternative to the Hamming
distance measure in NICE. Second, we introduce the image contour fidelity (ICF) computational model that is
compatible with any block-based quality estimator. The ICF pools weighted fidelity degradations across image
blocks with weights based on the local contour strength of an image block, and allows quality estimators to be
repurposed as utility estimators.
The performances of these approaches were evaluated on the CU-Nantes and CU-ObserverCentric databases,
which provide perceived utility scores for a collection of distorted images. While the distance transform provides
an improvement over the Hamming distance, the ICF model shows greater promise. The performances of common
fidelity estimators for utility estimation are substantially improved when they are used in ICF computational
model. This suggests that the utility estimation problem can be recast as a problem of fidelity estimation on
image contours.

Early visual processing as a method to speed up computations on visual input data has long been
discussed in the computer vision community. The general target of a such approaches is to filter nonrelevant
information from the costly higher-level visual processing algorithms. By insertion of this
additional filter layer the overall approach can be speeded up without actually changing the visual
processing methodology.
Being inspired by the layered architecture of the human visual processing apparatus, several approaches
for early visual processing have been recently proposed. Most promising in this field is the
extraction of a saliency map to determine regions of current attention in the visual field. Such saliency
can be computed in a bottom-up manner, i.e. the theory claims that static regions of attention emerge
from a certain color footprint, and dynamic regions of attention emerge from connected blobs of textures
moving in a uniform way in the visual field. Top-down saliency effects are either unconscious through
inherent mechanisms like inhibition-of-return, i.e. within a period of time the attention level paid to
a certain region automatically decreases if the properties of that region do not change, or volitional
through cognitive feedback, e.g. if an object moves consistently in the visual field. These bottom-up
and top-down saliency effects have been implemented and evaluated in a previous computer vision
system for the project JAST.
In this paper an extension applying evolutionary processes is proposed. The prior vision system utilized
multiple threads to analyze the regions of attention delivered from the early processing mechanism.
Here, in addition, multiple saliency units are used to produce these regions of attention. All of these
saliency units have different parameter-sets. The idea is to let the population of saliency units create
regions of attention, then evaluate the results with cognitive feedback and finally apply the genetic
mechanism: mutation and cloning of the best performers and extinction of the worst performers considering
computation of regions of attention. A fitness function can be derived by evaluating, whether
relevant objects are found in the regions created.
It can be seen from various experiments, that the approach significantly speeds up visual processing,
especially regarding robust ealtime object recognition, compared to an approach not using saliency
based preprocessing. Furthermore, the evolutionary algorithm improves the overall performance of
the preprocessing system in terms of quality, as the system automatically and autonomously tunes
the saliency parameters. The computational overhead produced by periodical clone/delete/mutation
operations can be handled well within the realtime constraints of the experimental computer vision
system. Nevertheless, limitations apply whenever the visual field does not contain any significant
saliency information for some time, but the population still tries to tune the parameters - overfitting
avoids generalization in this case and the evolutionary process may be reset by manual intervention.

The saliency of an image or video region indicates how likely it is that the viewer of the image or video fixates
that region due to its conspicuity. An intriguing question is how we can change the video region to make it more
or less salient. Here, we address this problem by using a machine learning framework to learn from a large set
of eye movements collected on real-world dynamic scenes how to alter the saliency level of the video locally. We
derive saliency transformation rules by performing spatio-temporal contrast manipulations (on a spatio-temporal
Laplacian pyramid) on the particular video region. Our goal is to improve visual communication by designing
gaze-contingent interactive displays that change, in real time, the saliency distribution of the scene.

The relationship between attention and consciousness is a close one, leading many scholars to
conflate the two. However, recent research has slowly corroded a belief that selective attention and
consciousness are so tightly entangled that they cannot be individually examined.

Contrast sensitivity has been extensively studied over the last decades and there are well-established models of
early vision that were derived by presenting the visual system with synthetic stimuli such as sine-wave gratings
near threshold contrasts. Natural scenes, however, contain a much wider distribution of orientations, spatial
frequencies, and both luminance and contrast values. Furthermore, humans typically move their eyes two to
three times per second under natural viewing conditions, but most laboratory experiments require subjects to
maintain central fixation. We here describe a gaze-contingent display capable of performing real-time contrast
modulations of video in retinal coordinates, thus allowing us to study contrast sensitivity when dynamically
viewing dynamic scenes. Our system is based on a Laplacian pyramid for each frame that efficiently represents
individual frequency bands. Each output pixel is then computed as a locally weighted sum of pyramid levels to
introduce local contrast changes as a function of gaze. Our GPU implementation achieves real-time performance
with more than 100 fps on high-resolution video (1920 by 1080 pixels) and a synthesis latency of only 1.5ms.
Psychophysical data show that contrast sensitivity is greatly decreased in natural videos and under dynamic
viewing conditions. Synthetic stimuli therefore only poorly characterize natural vision.

The history of eye-movement research extends back at least to 1794, when Erasmus Darwin (Charles' grandfather)
published Zoonomia, including descriptions of eye movements due to self-motion. But research on eye movements was
restricted to the laboratory for 200 years, until Michael Land built the first wearable eyetracker at the University of
Sussex and published the seminal paper "Where we look when we steer" [1]. In the intervening centuries, we learned a
tremendous amount about the mechanics of the oculomotor system and how it responds to isolated stimuli, but virtually
nothing about how we actually use our eyes to explore, gather information, navigate, and communicate in the real world.
Inspired by Land's work, we have been working to extend knowledge in these areas by developing hardware, algorithms,
and software that have allowed researchers to ask questions about how we actually use vision in the real world. Central
to that effort are new methods for analyzing the volumes of data that come from the experiments made possible by the
new systems. We describe a number of recent experiments and SemantiCode, a new program that supports assisted
coding of eye-movement data collected in unrestricted environments.

What is the representation in early vision? Considerable research has demonstrated that the representation is not equally
faithful throughout the visual field; representation appears to be coarser in peripheral and unattended vision, perhaps as a
strategy for dealing with an information bottleneck in visual processing. In the last few years, a convergence of evidence
has suggested that in peripheral and unattended regions, the information available consists of local summary statistics.
Given a rich set of these statistics, many attributes of a pattern may be perceived, yet precise location and configuration
information is lost in favor of the statistical summary. This representation impacts a wide range of visual tasks,
including peripheral identification, visual search, and visual cognition of complex displays. This paper discusses the
implications for understanding visual perception, as well as for imaging applications such as information visualization.

David Marr famously defined vision as "knowing what is where by seeing". In the framework described here, attention is
the inference process that solves the visual recognition problem of what is where. The theory proposes a computational
role for attention and leads to a model that performs well in recognition tasks and that predicts some of the main properties
of attention at the level of psychophysics and physiology. We propose an algorithmic implementation a Bayesian network
that can be mapped into the basic functional anatomy of attention involving the ventral stream and the dorsal stream. This
description integrates bottom-up, feature-based as well as spatial (context based) attentional mechanisms. We show that
the Bayesian model predicts well human eye fixations (considered as a proxy for shifts of attention) in natural scenes, and
can improve accuracy in object recognition tasks involving cluttered real world images. In both cases, we found that the
proposed model can predict human performance better than existing bottom-up and top-down computational models.

This study aims to promote the cubic effect by reproducing images with depth perception using chromostereopsis in
human visual perception. From psychophysical experiments based on the theory that the cubic effect depends on the
lightness of the background in the chromostereoptic effect and the chromostereoptic reversal effect, it was found that the
luminous cubic effect differs depending on the lightness of the background and the hue combination of the neighboring
colors.
Also, the layer of the algorithm-enhancing cubic effect that was drawn from the result of the experiment was classified
into the foreground, middle, and background layers according to the depth of the input image. For the respective
classified layer, the color factors that were detected through the psychophysical experiments were adaptively controlled
to produce an enhanced cubic effect that is appropriate for the properties of human visual perception and the
characteristics of the input image.

Field-Sequential Color (FSC) displays have been discussed for a long time. Its main concept is to remove a color filter so
that we may increase the light transmittance of an LCD panel. However, FSC displays have a major problem: color
break-up (CBU). Moreover, it is difficult to quantify the CBU in saccadic eye movements, because the phenomenon
occurs as quickly as a flash in saccadic eye movements, and there are individual variations for perceiving the CBU.
Some previous studies have presented assessments of saccadic CBU, but not indicated the detection and allowance
thresholds of the target size in horizontal saccadic eye movements. Then, we conducted psychophysical experiments
based on an FSC display driving with sub-frame frequency of 240Hz-1440Hz (each frame consist of red, green, and blue
sub-frames). We employed a simple stimulus for our experiment, a static white bar with variable width. We tasked ten
subjects a fixed saccade length of 58.4 visual degrees in horizontal eye movements, and a fixed target luminance of
15.25cd/m2. We examined PEST method to find detection and allowance thresholds of white bar width for saccadic
CBU. This paper provides correlations between target sizes and sub-frame frequencies of an FSC display device, and
proposes an easy evaluation method of perceiving saccadic CBU on FSC displays.

The issue of reading on electronic devices is getting important as the popularity of mobile devices, such as cell phones or
PDAs, increases. In this study, we used the spatial summation paradigm to measure the spatial constraints for text detection.
Four types of stimuli (real characters, non-characters, Jiagu and scrambled lines) were used in the experiments. All
characters we used had two components in a left-right configuration. A non-character was constructed by swapping the left
and right components of a real character in position to render it unpronounceable. The Jiagu characters were ancient texts
and have the same left-right configuration as the modern Chinese characters, but contain no familiar components. Thus,
the non-characters keep the components while destroy the spatial configuration between them and the Jaigu characters
have no familiar component while keep the spatial configuration intact. The detection thresholds for the same stimulus size
and the same eccentricity were the same for all types of stimuli. When the text-size is small, the detection threshold of a
character decreased with the increase in its size, with a slope of -1/2 on log-log coordinates, up to a critical size at all
eccentricities and for all stimulus types. The sensitivity for all types of stimuli was increased from peripheral to central
vision. In conclusion, the detectability is based on local feature analysis regardless of character types. The cortical
magnification, E2, is 0.82 degree visual angle. With this information, we can estimate the detectability of a character by its
size and eccentricity.

We have developed a mobile vision assistive device based on a head mounted display (HMD) with a video camera,
which provides image magnification and contrast enhancement for patients with central field loss (CFL). Because the
exposure level of the video camera is usually adjusted according to the overall luminance of the scene, the contrast of
sub-images (to be magnified) may be low. We found that at high magnification levels, conventional histogram
enhancement methods frequently result in over- or under-enhancement due to irregular histogram distribution of subimages.
Furthermore, the histogram range of the sub-images may change dramatically when the camera moves, which
may cause flickering. A piece-wise histogram stretching method based on a center emphasized histogram is proposed
and evaluated by observers. The center emphasized histogram minimizes the histogram fluctuation due to image changes
near the image boundary when the camera moves slightly, which therefore reduces flickering after enhancement. A
piece-wise histogram stretching function is implemented by including a gain turnaround point to deal with very low
contrast images and reduce the possibility of over enhancement. Six normally sighted subjects and a CFL patient were
tested for their preference of images enhanced by the conventional and proposed methods as well as the original images.
All subjects preferred the proposed enhancement method over the conventional method.

Real-time videoconferencing using cellular devices provides natural communication to the Deaf community. For
this application, compressed American Sign Language (ASL) video must be evaluated in terms of the intelligibility
of the conversation and not in terms of the overall aesthetic quality of the video. This work presents a paired
comparison experiment to determine the subjective preferences of ASL users in terms of the trade-off between
intelligibility and quality when varying the proportion of the bitrate allocated explicitly to the regions of the
video containing the signer. A rate-distortion optimization technique, which jointly optimizes a quality criteria
and an intelligibility criteria according to a user-specified parameter, generates test video pairs for the subjective
experiment. Experimental results suggest that at sufficiently high bitrates, all users prefer videos in which the
non-signer regions in the video are encoded with some nominal rate. As the total encoding bitrate decreases,
users generally prefer video in which a greater proportion of the rate is allocated to the signer. The specific
operating points preferred in the quality-intelligibility trade-off vary with the demographics of the users.

This paper discusses a simulation that was created of a model presented three years ago at this conference of
a neuron as a micro machine for doing metaphor by cognitive blending. The model background is given, the
difficulties of building such a model are discussed, and a description of the simulation is given based on texture
synthesis structures and texture patches. These are glued together using Formal Concept Analysis. Because
of this and because of the hyperbolic and Euclidean geometry intertwining and local activation, an interesting
fundamental connection between analogical processing and glial and neural processing is discovered.

A picture" is a at object covered with pigments in a certain pattern. Human observers, when looking "into" a
picture (photograph, painting, drawing, . . . say) often report to experience a three-dimensional "pictorial space."
This space is a mental entity, apparently triggered by so called pictorial cues. The latter are sub-structures of
color patterns that are pre-consciously designated by the observer as "cues," and that are often considered to
play a crucial role in the construction of pictorial space. In the case of the visual arts these structures are
often introduced by the artist with the intention to trigger certain experiences in prospective viewers, whereas
in the case of photographs the intentionality is limited to the viewer. We have explored various methods to
operationalize geometrical properties, typically relative to some observer perspective. Here perspective" is
to be understood in a very general, not necessarily geometric sense, akin to Gombrich's beholder's share".
Examples include pictorial depth, either in a metrical, or a mere ordinal sense. We nd that different observers
tend to agree remarkably well on ordinal relations, but show dramatic differences in metrical relations.