Significance

Where do people look in images? Predicting eye movements from images is an active field of study, with more than 50 quantitative prediction models competing to explain scene viewing behavior. Yet the rules for this competition are unclear. Using a principled metric for model comparison (information gain), we quantify progress in the field and show how formulating the models probabilistically resolves discrepancies in other metrics. We have also developed model assessment tools to reveal where models fail on the database, image, and pixel levels. These tools will facilitate future advances in saliency modeling and are made freely available in an open source software framework (www.bethgelab.org/code/pysaliency).

Abstract

Learning the properties of an image associated with human gaze placement is important both for understanding how biological systems explore the environment and for computer vision applications. There is a large literature on quantitative eye movement models that seeks to predict fixations from images (sometimes termed “saliency” prediction). A major problem known to the field is that existing model comparison metrics give inconsistent results, causing confusion. We argue that the primary reason for these inconsistencies is because different metrics and models use different definitions of what a “saliency map” entails. For example, some metrics expect a model to account for image-independent central fixation bias whereas others will penalize a model that does. Here we bring saliency evaluation into the domain of information by framing fixation prediction models probabilistically and calculating information gain. We jointly optimize the scale, the center bias, and spatial blurring of all models within this framework. Evaluating existing metrics on these rephrased models produces almost perfect agreement in model rankings across the metrics. Model performance is separated from center bias and spatial blurring, avoiding the confounding of these factors in model comparison. We additionally provide a method to show where and how models fail to capture information in the fixations on the pixel level. These methods are readily extended to spatiotemporal models of fixation scanpaths, and we provide a software package to facilitate their use.

Humans move their eyes about three times/s when exploring the environment, fixating areas of interest with the high-resolution fovea. How do we determine where to fixate to learn about the scene in front of us? This question has been studied extensively from the perspective of “bottom–up” attentional guidance (1), often in a “free-viewing” task in which a human observer explores a static image for some seconds while his or her eye positions are recorded (Fig. 1A). Eye movement prediction is also applied in domains from advertising to efficient object recognition. In computer vision the problem of predicting fixations from images is often referred to as “saliency prediction,” while to others “saliency” refers explicitly to some set of low-level image features (such as edges or contrast). In this paper we are concerned with predicting fixations from images, taking no position on whether the features that guide eye movements are “low” or “high” level.

Evaluation of fixation prediction models in terms of information. (A, Upper) Two example images with fixation locations (black points) and scanpaths (red). (A, Lower) Corresponding fixation predictions from an example model (AIM). Warmer colors denote more expected fixations. (B) Model rankings by seven metrics on the MIT Saliency Benchmark. Models are arranged along the x axis, ordered by “AUC-Judd” performance (highest-performing model to the right). Relative performance (y axis) shows each metric rescaled by baseline (0) and gold standard (1; higher is better). If the metrics gave consistent rankings, all colored lines would monotonically increase. (C) Different model comparison metrics evaluated on the raw model predictions (as in the MIT Benchmark), compared with information gain explained. Each color corresponds to a different metric (see key); each model forms a distinct column. The gray diagonal line shows where a metric would lie if it was linear in information. Many metrics are nonmonotonically related to information, explaining ranking inconsistencies in B. (C, Inset) Pearson (below diagonal) and Spearman (above diagonal) correlation coefficients in relative performance under the different metrics. (D) The same as C but for model predictions converted to probability densities, accounting for center bias and blurring. All metrics are now approximately monotonically related to information gain explained; correlations in relative performance between metrics are now uniformly high (D, Inset). Note that information gain is the only metric that is linear, because all metrics must converge to the gold standard model at (1, 1). (E) How close is the field to understanding image-based fixation prediction? Each model evaluated in the current study is arranged on the x axis in order of information gain explained. The best-performing model (eDN) explains about one-third of the information in the gold standard.

The field of eye movement prediction is quite mature: Beginning with the influential model of Itti et al. (1), there are now over 50 quantitative fixation prediction models, including around 10 models that seek to incorporate “top–down” effects (see refs. 2⇓–4 for recent reviews and analyses of this extensive literature). Many of these models are designed to be biologically plausible whereas others aim purely at prediction (e.g., ref. 5). Progress is measured by comparing the models in terms of their prediction performance, under the assumption that better-performing models must capture more information that is relevant to human behavior.

How close are the best models to explaining fixation distributions in static scene eye guidance? How close is the field to understanding image-based fixation prediction? To answer this question requires a principled distance metric, yet no such metric exists. There is significant uncertainty about how to compare saliency models (3, 6⇓–8). A visit to the well-established MIT Saliency Benchmark (saliency.mit.edu) allows the reader to order models by seven different metrics. These metrics can vastly change the ranking of the models, and there is no principled reason to prefer one metric over another. Indeed, a recent paper (7) compared 12 metrics, concluding that researchers should use 3 of them to avoid the pitfalls of any one. Following this recommendation would mean comparing fixation prediction models is inherently ambiguous, because it is impossible to define a unique ranking if any two of the considered rankings are inconsistent.

Because no comparison of existing metrics can tell us how close we are, we instead advocate a return to first principles. We show that evaluating fixation prediction models in a probabilistic framework can reconcile ranking discrepancies between many existing metrics. By measuring information directly we show that the best model evaluated here (state of the art as of October 2014) explains only 34% of the explainable information in the dataset we use.

Results

Information Gain.

Fixation prediction is operationalized by measuring fixation densities. If different people view the same image, they will place their fixations in different locations. Similarly, the same person viewing the same image again will make different eye movements than they did the first time. It is therefore natural to consider fixation placement as a probabilistic process.

The performance of a probabilistic model can be assessed using information theory. As originally shown by Shannon (9), information theory provides a measure, information gain, to quantify how much better a posterior predicts the data than a prior. In the context of fixation prediction, this quantifies how much better an image-based model predicts the fixations on a given image than an image-independent baseline.

Information gain is measured in bits. To understand this metric intuitively, imagine a game of 20 questions in which a model is asking yes/no questions about the location of a fixation in the data. The model’s goal is to specify the location of the fixation. If model A needs one question less than model B on average, then model A’s information gain exceeds model B’s information gain by one bit. If a model needs exactly as many questions as the baseline, then its information gain is zero bits. The number of questions the model needs is related to the concept of code length: Information gain is the difference in the average code length between a model and the baseline. Finally, information gain can also be motivated from the perspective of model comparison: It is the logarithm of the Bayes factor of the model and the baseline, divided by the number of data points. That is, if the information gain exceeds zero, then the model is more likely than the baseline.

Formally, if p^A(xi,yi|Ii) is the probability that model A assigns to a fixation in location (xi,yi) when image Ii is viewed, and pbl(xi,yi) is the probability of the baseline model for this fixation, then the information gain of model A with respect to the image-independent baseline is (1/N)∑iNlogp^A(xi,yi|Ii)−log⁡pbl(xi,yi) (to be precise, this value is the estimated expected information gain). Although information gain can be rewritten in terms of Kullback–Leibler (KL) divergence, our approach is fundamentally different from how KL divergence has previously been used to compare saliency models (SI Text, Kullback–Leibler Divergence).

For image-based fixation prediction, information gain quantifies the reduction in uncertainty (intuitively, the scatter of predicted fixations) in where people look, given knowledge of the image they are looking at. To capture the image-independent structure in the fixations in a baseline model, we use a 2D histogram of all fixations cross-validated between images: How well can the fixations on one image be predicted from fixations on all other images?

In addition to being principled, information gain is an intuitive model comparison metric because it is a ratio scale. Like the distance between two points, in a ratio-scaled metric “zero” means the complete absence of the quantity (in this case, no difference in code length from baseline). Second, a given change in the scale means the same thing no matter the absolute values. That is, it is meaningful to state relationships such as “the difference in information gain between models A and B is twice as big as the difference between models C and D.” Many existing metrics, such as the area under the ROC curve (AUC), do not meet these criteria.

To know how well models predict fixation locations, relative to how they could perform given intersubject variability, we want to compare model information gain to some upper bound. To estimate the information gain of the true fixation distribution, we use a nonparametric gold standard model: How well can the fixations of one subject be predicted by all other subjects’ fixations? This gold standard captures the explainable information gain for image-dependent fixation patterns for the subjects in our dataset, ignoring additional task- and subject-specific information (we examine this standard further in SI Text, Gold Standard Convergence and Fig. S1). By comparing the information gain of models to this explainable information gain, we determine the proportion of explainable information gain explained. Like variance explained in linear Gaussian regression, this quantity tells us how much of the explainable information gain a model captures. Negative values mean that a model performs even worse than the baseline.

Dependence of gold standard performance on the number of subjects used to predict one subject’s data.

Reconciling the Metrics.

Now that we have defined a principled and intuitive scale on which to compare models we can assess to what extent existing metrics align with this scale. In Fig. 1B we show the relative performance on all metrics for all saliency models listed on the MIT Saliency Benchmark website as of February 25, 2015. If all metrics gave consistent rankings, all colored lines would monotonically increase. They clearly do not, highlighting the problem with existing metrics.

Fig. 1C shows how the fixation prediction models we evaluate in this paper perform on eight popular fixation prediction metrics (colors) and information gain explained. As in Fig. 1B, the metrics are inconsistent with one another. This impression is confirmed in Fig. 1C, Inset, showing Pearson (below the diagonal) and Spearman (above the diagonal) correlation coefficients. If the metrics agreed perfectly, this plot matrix would be red. When considered relative to information gain explained, the other metrics are generally nonmonotonic and inconsistently scaled.

Why is this the case? The primary reason for the inconsistencies in Fig. 1 B and C is that both the models and the metrics use different definitions of the meaning of a saliency map (the spatial fixation prediction). For example, the “AUC wrt. uniform” metric expects the model to account for the center bias (a bias in free-viewing tasks to fixate near the center of the image), whereas “AUC wrt. center bias” expects the model to ignore the center bias (10). Therefore, a model that accounts for the center bias is penalized by AUC wrt. center bias whereas a model that ignores the center bias is penalized by AUC wrt. uniform. The rankings of these models will likely change between the metrics, even if they had identical knowledge about the image features that drive fixations.

To overcome these inconsistencies we phrased all models probabilistically, fitting three independent factors. We transformed the (often arbitrary) model scale into a density, accounted for the image-independent center bias in the dataset, and compensated for overfitting by applying spatial blurring. We then reevaluated all metrics on these probabilistic models. This yields far more consistent outcomes between the metrics (Fig. 1D). The metrics are now monotonically related to information gain explained, creating mostly consistent model rankings (compare the correlation coefficient matrices in Fig. 1 C and D, Insets).

Nevertheless, Fig. 1D also highlights one additional, critical point. All model relative performances must reconverge to the gold standard performance at (1, 1). That all existing metrics diverge from the unity diagonal means that these metrics remain nonlinear in information gain explained. This creates problems in comparing model performance. If we are interested in the information that is explained, then information gain is the only metric that can answer this question in an undistorted way.

How Close Is the Field to Understanding Image-Based Fixation Prediction?

We have shown above that a principled definition of fixation prediction serves to reconcile ranking discrepancies between existing metrics. Information gain explained also tells us how much of the information in the data is accounted for by the models. That is, we can now provide a principled answer to the question, “How close is the field to understanding image-based fixation prediction?”.

Fig. 1E shows that the best-performing model we evaluate here, ensemble of deep networks (eDN), accounts for about 34% of the explainable information gain, which is 1.21 bits per fixation (bits/fix) in this dataset (SI Text, Model Performances as Log-Likelihoods and Fig. S2). These results highlight the importance of using an intuitive evaluation metric: As of October 2014, there remained a significant amount of information that image-based fixation prediction models could explain but did not.

Average log-likelihoods of all tested models as differences in log-likelihood compared with the maximum-entropy model predicting a uniform fixation distribution. Model performance indicates the model performance if only the nonlinearity has been fitted. Centerbias and blur+centerbias indicate the model performances if the centerbias alone or the blur and centerbias have been fitted together with the nonlinearity.

Information Gain in the Pixel Space.

The probabilistic framework for model comparison we propose above has an additional advantage over existing metrics: The information gain of a model can be evaluated at the level of pixels (Table S1). We can examine where and by how much model predictions fail.

This procedure is schematized in Fig. 2. For an example image, the model densities show where the model predicts fixations to occur in the given image (Fig. 2A). This prediction is then divided by the baseline density, yielding a map showing where and by how much the model believes the fixation distribution in a given image is different from the baseline (“image-based prediction”). If the ratio is greater than one, the model predicts there should be more fixations than the center bias expects. The “information gain” images in Fig. 2 quantify how much a given pixel contributes to the model’s performance relative to the baseline (code length saved in bits/fix). Finally, the difference between the model’s information gain and the possible information gain, estimated by the gold standard, is shown in “difference to real information gain”: It shows where and how much (bits) the model wastes information that could be used to describe the fixations more efficiently.

Calculation of information gain in the pixel space. (A) For the hypothetical example image shown (Left), hypothetical fixation densities of the gold standard (“true”) and model predictions are shown in the “density” column. These are divided by the baseline model (prior) to get the “image-based prediction” map. Both maps are then log-transformed and multiplied by the gold standard density to calculate information gain for each pixel. Subtracting the gold standard information gain from the model’s information gain yields a difference map of the possible information gain: that is, where and by how much the model’s predictions fail. In this case, the model overestimates (blue contours) the fixation density in the left (red) spot in the image, underestimates (red contours) the center (green) spot, and predicts the rightmost (yellow) spot almost perfectly. (B) For an example image from the Judd dataset (Left), the pixel space information gains are shown as in A for the gold standard (first row), eDN (second row), BMS (third row), and AIM (fourth row). eDN performs best for the image overall (3.12 bits/fix compared with 2.59 bits/fix and 2.28 bits/fix). By examining the pixel space information gains, we see this is because it correctly assigns large density to the boat, whereas the other models both underestimate the saliency of the boat. For the eDN model, the difference plot shows that it slightly overestimates the saliency of the front of the boat relative to the back.

The advantage of this approach is that we can see not only how much a model fails (on an image or dataset level), but also exactly where it fails, in individual images. This can be used to make informed decisions about how to improve fixation prediction models. In Fig. 2B, we show an example image and the performance of the three best-performing models [eDN, Boolean map-based saliency (BMS), and attention based on information maximization (AIM)]. The pixel space information gains show that the eDN model correctly assigns large density to the boat, whereas the other models both underestimate the saliency of the boat.

To extend this pixel-based analysis to the level of the entire dataset, we display each image in the dataset according to its possible information gain and the percentage of that information gain explained by the eDN model (Fig. 3). In this space, points to the bottom right represent images that contain a lot of explainable information in the fixations that the model fails to capture. Points show all images in the dataset, and for a subset of these we have displayed the image itself. The images in the bottom right of the plot tend to contain human faces. See SI Text, Pixel-Based Analysis on Entire Dataset for an extended version of this analysis including pixel-space information gain plots and a model comparison.

Distribution of information gains and explained information (both relative to a uniform baseline model) over all images in the dataset for the eDN model. Each black circle represents an image from the dataset. These plots allow model performance to be assessed on all images in the dataset. Points in the lower right of the scatterplots are images where a lot of information could be explained but is not; these are where the model could be best improved for a given dataset. See Fig. S5 for an extended version of this plot, including an additional model and pixel-space information gain plots showing where the model predictions fail in individual images.

Discussion

Predicting where people look in images is an important problem, yet progress has been hindered by model comparison uncertainty. We have shown that phrasing fixation prediction models probabilistically and appropriately evaluating their performance cause the disagreement between many existing metrics to disappear. Furthermore, bringing the model comparison problem into the principled domain of information allows us to assess the progress of the field, using an intuitive distance metric. The best-performing model we evaluate here (eDN) explains about 34% of the explainable information gain. More recent model submissions to the MIT Benchmark have significantly improved on this number (e.g., ref. 11). This highlights one strength of information gain as a metric: As model performance begins to approach the gold standard, the nonlinear nature of other metrics (e.g., AUC) causes even greater distortion of apparent progress. The utility of information gain is clear.

To improve models it is useful to know where in images this unexplained information is located. We developed methods not only to assess model performance on a database level, but also to show where and by how much model predictions fail in individual images, on the pixel level (Figs. 2 and 3). We expect these tools will be useful for the model development community, and we provide them in our free software package.

Many existing metrics can be understood as evaluating model performance on a specific task. For example, the AUC is the performance of a model in a two-alternative forced-choice (2AFC) task, “Which of these two points was fixated?”. If this is the task of interest to the user, then AUC is the right metric. Our results do not show that any existing metric is wrong. The metrics do not differ because they capture fundamentally different properties of fixation prediction, but mainly because they do not agree on the definition of “saliency map.” The latter case requires only minor adjustments to move the field forward. This also serves to explain the three metric groups found by Riche et al. (7): One group contains among others AUC with uniform nonfixation distribution (called AUC-Judd by Riche), another group contains AUC with center bias nonfixation distribution (AUC-Borji), and the last group contains image-based KL divergence (KL-Div). We suggest that the highly uncorrelated results of these three groups are due to the fact that one group penalizes models without center bias, another group penalizes models with center bias, and the last group depends on the absolute saliency values. Compensating for these factors appropriately makes the metric results correlate almost perfectly.

Although existing metrics are appropriate for certain use cases, the biggest practical advantage in using a probabilistic framework is its generality. First, once a model is formulated in a probabilistic way many kinds of “task performance” can be calculated, depending on problems of applied interest. For example, we might be interested in whether people will look at an advertisement on a website or whether the top half of an image is more likely to be fixated than the bottom half. These predictions are a simple matter of integrating over the probability distribution. This type of evaluation is not well defined for other metrics that do not define the scale of saliency values. Second, a probabilistic model allows the examination of any statistical moments of the probability distribution that might be of practical interest. For example, Engbert et al. (12) examine the properties of second-order correlations between fixations in scanpaths. Third, information gain allows the contribution of different factors in explaining data variance to be quantified. For example, it is possible to show how much the center bias contributes to explaining fixation data independent of image-based saliency contributions (10) (SI Text, Model Performances as Log-Likelihoods and Fig. S2). Fourth, the information gain is differentiable in the probability density, allowing models to be numerically optimized using gradient techniques. In fact, the optimization is equivalent to maximum-likelihood estimation, which is ubiquitously used for density estimation and fulfills a few simple desiderata for density metrics (13). In some cases other loss functions may be preferable.

If we are interested in understanding naturalistic eye movement behavior, free viewing static images is not the most representative condition (14⇓⇓⇓–18). Understanding image-based fixation behavior is not only a question of “where?”, but of “when?” and “in what order?”. It is the spatiotemporal pattern of fixation selection that is increasingly of interest to the field, rather than purely spatial predictions of fixation locations. The probabilistic framework we use in this paper (10, 19) is easily extended to study spatiotemporal effects, by modeling the conditional probability of a fixation given previous fixations (Materials and Methods and ref. 12).

Accounting for the entirety of human eye movement behavior in naturalistic settings will require incorporating information about the task, high-level scene properties, and mechanistic constraints on the eye movement system (12, 15⇓–17, 20⇓–22). Our gold standard contains the influence of high-level (but still purely image-dependent) factors to the extent that they are consistent across observers. Successful image-based fixation prediction models will therefore need to use such higher-level features, combined with task-relevant biases, to explain how image features are associated with the spatial distribution of fixations over scenes.

Materials and Methods

Image Dataset and Fixation Prediction Models.

We use a subset of a popular benchmarking dataset (MIT-1003) (23) to compare and evaluate fixation prediction models. We used only the most common image size (1,024 × 768 px), resulting in 463 images included in the evaluation. We have verified our results in a second dataset of human fixations (24) (SI Text, Kienzle Dataset and Fig. S3).

Average log-likelihoods of all tested models on the Kienzle dataset as in Fig. S2.

We evaluated all models considered in ref. 25 and the top-performing models added to the MIT Saliency Benchmarking website (saliency.mit.edu) up to October 2014. For all models, the original source code and default parameters have been used unless stated otherwise. The included models are Itti et al. (1) [here, two implementations have been used: one from the Saliency Toolbox and the variant specified in the graph-based visual saliency (GBVS) paper], Torralba et al. (26), GBVS (27), saliency using natural statistics (SUN) (28) (for “SUN, original” we used a scale parameter of 0.64, corresponding to the pixel size of 2.3′ of visual angle of the dataset used to learn the filters; for “SUN, optimal” we did a grid search over the scale parameter; this resulted in a scale parameter of 0.15), Kienzle et al. (24, 29) (patch size 195 pixels corresponding to their reported optimal patch size of 5.4°). Hou and Zhang (30), AIM (31), Judd et al. (23), context-aware saliency (32, 33), visual saliency estimation by nonlinearly integrating features using region covariances (CovSal) (34), multiscale rarity-based saliency detection (RARE2012) (35), BMS (5, 36), and finally eDN (37). Table S2 specifies the source code used for each model.

Information Gain and Comparison Models.

Given fixations (xi,yi) on images Ii and predictions of a probabilistic model p^(x,y|I), the average log-likelihood for the data is (1/N)∑iNlogp^(xi,yi|Ii) and the information gain with respect to an image-independent baseline model pbl(x,y) isIG(p^‖pbl)=1N∑iNlog⁡p^(xi,yi|Ii)−log⁡pbl(xi,yi).

In this paper we use the logarithm to base 2, meaning that information gain is in bits. Model comparison within the framework of likelihoods is well defined and the standard of any statistical model comparison enterprise.

The baseline model is a 2D histogram model with a uniform regularization (to avoid zero bin counts) cross-validated between images (trained on all fixations for all observers on all other images). That is, reported baseline performance used all fixations from other images to predict the fixations for a specific image: It captures the image-independent spatial information in the fixations. Bin width and regularization parameters were optimized by gridsearch. If a saliency model captured all of the behavioral fixation biases but nothing about what causes parts of an image to attract fixations, it would do as well as the baseline model.

Fixation preferences that are inconsistent between observers are by definition unpredictable from fixations alone. If we have no additional knowledge about interobserver differences, the best predictor of an observer’s fixation pattern on a given image is therefore to average the fixation patterns from all other observers and add regularization. This is our gold standard model. It was created by blurring the fixations with a Gaussian kernel and including a multiplicative center bias (Phrasing Saliency Maps Probabilistically), learned by leave-one-out cross-validation between subjects. That is, the reported gold standard performance (for information gain and AUCs) always used only fixations from other subjects to predict the fixations of a specific subject, therefore giving a conservative estimate of the explainable information. It accounts for the amount of information in the spatial structure of fixations to a given image that can be explained while averaging over the biases of individual observers. This model is the upper bound on prediction in the dataset (see ref. 8 for a thorough comparison of this gold standard and other upper bounds capturing different constraints).

Fixation-based Kullback–Leibler divergence for saliency maps. Upper Left shows a real saliency map (from eDN), Upper Right is inverted, Lower Left is the same map with binned saliency values, and in the Lower Right map, the saliency assigned to each bin is shuffled. These maps have identical fixation-based KL divergence (and very different log-likelihoods).

Phrasing Saliency Maps Probabilistically.

We treat the normalized saliency map [s(x,y|I) denotes the saliency at point (x,y) in image I] as the predicted gaze density for the fixations: p^(x,y|I)∝s(x,y|I). This definition marginalizes over previous fixation history and fixation timings, which are not included in any evaluated models.

Because many of the models were optimized for AUC, and because AUC is invariant to monotonic transformations whereas information gain is not, we cannot simply compare the models’ raw saliency maps to one another. The saliency map for each model was therefore transformed by a pointwise monotonic nonlinearity that was optimized to give the best log-likelihood for that model. This corresponds to picking the model with the best log-likelihood from all models that are equivalent (under AUC) to the original model.

Every saliency map was jointly rescaled to range from 0 to 1 (i.e., over all images at once, not per image, keeping contrast changes from image to image intact).

Then a Gaussian blur with radius σ was applied that allowed us to compensate in models that make overly precise, confident predictions of fixation locations (25).

Next, the pointwise monontonic nonlinearity was applied. This nonlinearity was modeled as a continuous piecewise linear function supported in 20 equidistant points xi between 0 and 1 with values yi with 0≤x0≤…≤x19: pnonlin(x,y)∝fnonlin(s(x,y)) with fnonlin(x)=(yi+1−yi)/(xi+1−xi)(x−xi)+yi for xi≤x≤xi+1.

Finally, we included a center bias term (accounting for the fact that human observers tend to look toward the center of the screen) (25).

The center bias was modeled as pcb(x,y)∝fcb(d(x,y))pnonlin(x,y).

Here, d(x,y)=(x−xc)2+α(y−yc)2/dmax is the normalized distance of (x,y) to the center of the image (xc,yc) with eccentricity α, and fcb(d) is again a continuous piecewise linear function that was fitted in 12 points.

All parameters were optimized jointly, using the L-BFGS SLSQP algorithm from scipy.optimize (38).

Evaluating the Metrics on Probabilistic Models.

To evaluate metrics described above on the probabilistic models (the results shown in Fig. 1D), we used the log-probability maps as saliency maps. All other computations were as described above. An exception is the image-based KL divergence. Because this metric operates on probability distributions, our model predictions were used directly.

The elements of Fig. 2 are calculated as follows: First, we plot the model density for each model (column “density” in Fig. 2). This is p^(x,y|I). Then we plot the model’s image-based prediction p^(x,y|I)/pbl(x,y). It tells us where and how much the model believes the fixation distribution in a given image is different from the prior p(x,y) (baseline).

Now we separate the expected information gain (an integral over space) into its constituent pixels, as pgold(x,y|I)log(p^(x,y|I)/pbl(x,y)) [using the gold standard as an approximation for the real distribution p(x,y|I)]. Weighting by the gold standard pgold(x,y|I) results in a weaker penalty for incorrect predictions in areas where there are fewer fixations. Finally, the last column in Fig. 2 shows the difference between the model’s information gain and the possible information gain, estimated by the gold standard, resulting in p(x,y|I)log(p^(x,y|I)/p(x,y|I)).

Note that this detailed evaluation is not possible with existing saliency metrics (Table S1).

Generalization to Spatiotemporal Scanpaths.

The models we consider in this paper are purely spatial: They do not include any temporal dependencies. A complete understanding of human fixation selection would require an understanding of spatiotemporal behavior, that is, scanpaths. The model adaptation and optimization procedure we describe above can be easily generalized to account for temporal effects. For details see SI Text, Generalization to Spatiotemporal Scanpaths.

SI Text

Model Performances as Log-Likelihoods

In Fig. S2, we report the average log-likelihoods of the tested models. All reported log-likelihoods are relative to the maximum entropy model predicting a uniform fixation distribution.

The gold standard model shows that the total mutual information between the image and the spatial structure of the fixations amounts to 2.1 bits/fix. To give another intuition for this number, a model that would for every fixation always correctly predict the quadrant of the image in which it falls would also have a log-likelihood of 2 bits/fix.

The lower-bound model is able to explain 0.89 bits/fix of this mutual information. That is, 42% of the information in spatial fixation distributions can be accounted for by behavioral biases (e.g., the bias of human observers to look at the center of the image).

The eDN model performs best of all of the saliency models compared, with 1.29 bits/fix, capturing 62% of the total mutual information. It accounts for 19% more than the lower-bound model or 34% of the possible information gain (1.21 bits/fix) between baseline and gold standard.

Fig. S2 also shows performances where only a subset of our optimization procedure was performed, allowing the contribution of different stages of our optimization to be assessed. Considering only model performance (i.e., without also including center bias and blur factors; the pink sections in Fig. S2) shows that many of the models perform worse than the lower-bound model. This means that the center bias is more important than the portion of image-based saliency that these models do capture (39). Readers will also note that the center bias and blurring factors account for very little of the performance of the Judd model and the eDN model relative to most other models. This is because these models already include a center bias that is optimized for the Judd dataset.

Gold Standard Convergence

The absolute performance level of the gold standard (the estimate of explainable information gain) depends on the size of the dataset. With fewer data points, the true gold standard performance will be underestimated because more regularization is required to generalize across subjects. With enough data, our estimate of the gold standard will converge to the true gold standard performance.

To examine the convergence of our gold standard estimate in the dataset we use, we repeated our cross-validation procedure using, for each subject, only a subset of the other 14 subjects. Fig. S1 shows the average gold standard performance (in bits per fixation) as a function of the number of other subjects used for cross-validation. The curve rapidly increases and then begins to flatten as we reach the full dataset size. This result indicates that more data would be required to gain a precise estimate of the true gold standard performance. Nevertheless, that the curve begins to saturate indicates that more data are unlikely to qualitatively change the results we report here. If anything, the gold standard performance would increase, reducing our estimate of the explainable information gain explained (34%) even further.

Kienzle Dataset

We repeated the full evaluation on the dataset of Kienzle et al. (24). It consists of 200 grayscale images of size 1,024×678 px and 15 subjects. This dataset is of special interest, as the authors removed the photographer bias by using random crops from larger images. The results are shown in Fig. S3.

In this dataset, with 22% even less of the possible information gain is covered by the best model (here, GBVS. Note that we were not able to include eDN into this comparision, as the source code was not yet released at the time of the analysis). Removing the photographer bias leads to a smaller contribution (34%) of the nonparametric model compared with the increase in log-likelihood by saliency map-based models. The possible information gain is with 0.92 bits/fix smaller than for the Judd dataset (1.21 bits/fix) There are multiple possible reasons for this. Primarily, this dataset contains no pictures of people, but a lot of natural images. In addition, the images are in grayscale.

Pixel-Based Analysis on Entire Dataset

In Fig. S5, we display each image in the dataset according to its possible information gain and the percentage of that information gain explained by the model. In this space, points to the bottom right represent images that contain a lot of explainable information in the fixations that the model fails to capture. Points show all images in the dataset, and for a subset of these we have displayed the image itself (Fig. S5 A and C) and the information gain difference to the gold standard (Fig. S5 B and D). For the eDN model (Fig. S5 A and B), the images in the bottom right of the plot tend to contain human faces. The Judd model contains an explicit face detection module, and as can be seen in Fig. S5 C and D, it tends to perform better on these images. In terms of the whole dataset, however, the eDN model performs better on images with a moderate level of explainable information (around 3 bits/fix).

Distribution of information gains and explained information (both relative to a uniform baseline model) over all images in the dataset. Each black dot represents an image from the dataset. For some images we show the actual image (A and C) and the information gain difference from the gold standard (B and D). These plots allow model performance to be assessed on all images in the dataset. Points in the lower right of the scatterplots are images where a lot of information could be explained but is not; these are where the model could be best improved for a given dataset. The pixel-space information gain scatter plots (B and D) show exactly where in the images the model predictions fail.

Existing Metrics

We evaluate the models on several prominent metrics. The area under the curve (AUC) metrics are the most widely used. They calculate the performance of the model when using the saliency map as classifier score in a two-alternative forced-choice (2AFC) task where the model has to separate fixations from nonfixations. There are several variants of AUC scores, differing by the nonfixation distribution used and in approximations to speed up computation. We use all sample values as thresholds, therefore using no approximation. AUC wrt. uniform uses a uniform nonfixation distribution, i.e., the full saliency map as nonfixations [this corresponds to “AUC-Judd” in the MIT Benchmark (25)]. AUC wrt. center bias uses the fixations from all other images as nonfixations, thus capturing structure unrelated to the image [behavioral biases, primarily center bias (3, 4, 39)]. This corresponds to “sAUC” in the MIT benchmark (“shuffled AUC”).

Confusingly, there are two completely independent measures referred to as “Kullback–Leibler divergence” used in the saliency literature. We discuss the precise definitions of these metrics and their relationship to information gain as used in this paper in SI Text, KL Divergence. What we refer to as image-based Kullback–Leibler (KL) divergence treats the saliency maps as 2D probability distributions and calculates the KL divergence between the model distribution and an approximated true distribution (8, 39). To compute this metric, the saliency maps were rescaled to have a maximum of 1 and a minimum of at least 10−20 over all maps. The saliency maps are then divided by the sum of their values to convert them into probability distributions. We use our gold standard as the true distribution.

The other variant of KL divergence, here called fixation-based KL divergence, calculates the KL divergence between the distribution of saliency values at fixations and the distribution of saliency values at some choice of nonfixations (40). We use histograms with 10 bins to calculate the KL divergence. For the nonfixations, we use all saliency values [fixation-based (f.b.) DKL wrt. uniform] or the saliency values at the fixation locations of the fixations from all other images (f.b. DKL wrt. center bias).

Normalized scanpath saliency (NSS) normalizes each saliency map to have zero mean and unit variance and then takes the mean saliency value over all fixations.

The correlation coefficient (CC) metric normalizes the saliency maps of the model and the saliency maps of the approximated true distribution (gold standard) to have zero mean and unit variance and then calculates the correlation coefficient of these maps over all pixels.

Detailed Comparison of Log-Likelihoods, AUC, and KL Divergence

Here we consider the relationship between log-likelihoods and prominent existing saliency metrics: AUC and KL divergence.

AUC.

The most prominent metric used in the saliency literature is the area under the receiver operating characteristic curve (AUC). The AUC is the area under a curve of model hit rate against false positive rate for each threshold. It is equivalent to the performance in a 2AFC task where the model is “presented” with two image locations: one at which an observer fixated and another from a nonfixation distribution. The thresholded saliency value is the model’s decision, and the percentage correct of the model in this task across all possible thresholds is the AUC score. The different versions of AUC used in saliency research differ primarily in the nonfixation distribution used. This is usually either a uniformly selected distribution of not-fixated points across the image (e.g., in ref. 25) or the distribution of fixations for other images in the database [the shuffled AUC (3, 4, 39)]. The latter provides an effective control against center bias (a tendency for humans to look in the center of the screen, irrespective of the image content), by ensuring that both fixation and nonfixation distributions have the same image-independent bias. It is important to bear in mind that this measure will penalize models that explicitly try to model the center bias. The AUC therefore depends critically on the definition of the nonfixation distribution. In the case of the uniform nonfixation distribution, AUC is tightly related to area counts: Optimizing for AUC with uniform nonfixation distribution is equivalent to finding for each percentage 0≤r≤100 the area consisting of r% of the image that includes most fixations (10).

One characteristic of the AUC that is often considered an advantage is that it is sensitive only to the rank order of saliency values, not their scale (i.e., it is invariant under monotonic pointwise transformations) (39). This allows the modeling process to focus on the shape (i.e., the geometry of iso-saliency points) of the distribution of saliency without worrying about the scale, which is argued to be less important for understanding saliency than the contour lines (39). However, in certain circumstances the insensitivity of AUC to differences in saliency can lead to counterintuitive behavior, if we accept that higher saliency values are intuitively associated with more fixations.

By using the likelihood of points as a classifier score, one can compute the AUC for a probabilistic model just as for saliency maps. This has a principled connection with the probabilistic model itself: If the model performed the 2AFC task outlined above using maximum-likelihood classification, then the model’s performance is exactly the AUC. Given the real fixation distribution, it can also be shown that the best saliency map in terms of AUC with uniform nonfixation distribution is exactly the gaze density of the real fixation. However, this does not imply that a better AUC score will yield a better log-likelihood or vice versa. For more details and a precise derivation of these claims, see ref. 10.

Kullback–Leibler Divergence.

KL divergence is tightly related to log-likelihoods. However, KL divergence as used in practice in the saliency literature is not the same as the approach we advocate.

In general, the KL divergence between two probability distributions p and q is given byDKL[p‖q]=∫log(p(x)q(x))p(x)dxand is a popular measure of the difference between two probability distributions. In the saliency literature, there are at least two different model comparison metrics that have been called Kullback–Leibler divergence. Thus, when a study reports a KL metric, one needs to check how this was computed. The first variant treats the saliency map as a 2D probability distribution and computes the KL divergence between this predicted distribution and the empirical density map of fixations (8, 39); we call this image-based KL-divergence. The second metric referred to as Kullback–Leibler divergence is the KL divergence between the distribution of saliency values at fixations and the distribution of saliency values at nonfixation locations; we call this fixation-based KL divergence (40). This is calculated by binning the saliency values at fixations and nonfixations into a histogram and then computing the KL divergence of these histograms. Like AUC, it depends critically on the definition of the nonfixation distribution and additionally on the histogram binning. In Table S3 we list a number of papers using one of these two definitions of KL divergence.

We now precisely show the relationship between these measures and our information theoretic approach. Very generally, information theory can be derived from the task of assigning code words to different events that occur with different probabilities such that their average code word length becomes minimal. It turns out that the negative log-probability is a good approximation to the optimal code word length possible, which gives rise to the definition of the log-loss:l(x)=−log⁡p(x).In the case of a discrete uniform distribution p(x)=1n the log-loss for any possible x is simply log⁡n, i.e., the log of the number of possible values of x. Accordingly, the more ambiguous the possible values of a variable are, the larger its average log-loss, which is also known as its entropy:H[X]=E[−log⁡p(x)].If p(x) denotes the true distribution that accurately describes the variable behavior of x and we have a model q(x) of that distribution, then we can think of assigning code words to different values of x that are of length −log⁡q(x) and compute the average log-loss for the model distributionE[−logq(x)]=−∫p(x)log⁡q(x)dx=H[X]+DKL[p(x)‖q(x)].That is, the KL divergence measures how much the average log-loss of a model distribution q(x) exceeds the average log-loss of the true distribution. The KL divergence is also used to measure the information gain of an observation if p(x) denotes a posterior distribution that correctly describes the variability of x after the observation has been made whereas q(x) denotes the prior distribution. In a completely analog fashion we can measure how much more or less information one model distribution q1(x) provides about x than an alternative model q2(x) does by computing how much the average log-loss of model 1 is reduced (or increased) relative to the average log-loss of model 2. This can also be phrased as an expected log-likelihood ratio (ELLR; the concept of log-likelihood ratios is familiar to readers with knowledge of model comparison using, e.g., χ2 tests):ELLR:=[E−log⁡q2(x)]−E[−log⁡q1(x)]=E[log⁡q1(x)]−E[log⁡q2(x)]=∫p(x)logq1(x)q2(x)dx.In other words, very generally, the amount of information model 2 provides about a variable relative to model 1 can be measured by asking how much more efficiently the variable can be encoded when assuming the corresponding model distribution q2(x) instead of q1(x) for the encoding. Note that this reasoning does not require any of the two model distributions to be correct. For example, in the context of saliency maps we can ask what the best possible model distribution is that does not require any knowledge of the actual image content. This baseline model can capture general biases of the subjects such as the center bias. To evaluate the information provided by a saliency map that can be assigned to the specific content of an image we thus have to ask how much more the model distribution of that saliency model provides relative to the baseline model.

Our information gain metric reported in the main text is exactly the ELLR, where q1 is the model, q2 is the baseline, and we estimated the expectation value using the sampling estimator. The ELLR can be rewritten as a difference between KL divergences:ELLR=E[log(q1(x)/q2(x))]=E[log⁡q1(x)]−E[log⁡q2(x)]=DKL[p(x)‖q2(x)]−DKL[p(x)‖q1(x)].This naturally raises the question: Is our measure equivalent to the KL divergence that has been used in the saliency literature? The answer is no.

It is crucial to note that in the past the scale used for saliency maps was only a rank scale. This was the case because AUC was the predominant performance measure and is invariant under such transformations. That is, two saliency maps S1(x) and S2(x) were considered equivalent if a strictly monotonic increasing function g:ℝ→ℝ exists such that S1(x) = g(S2(x)). In contrast, in the equation for ELLR, the two distributions q1 and q2 are directly proportional to the saliency map times the center bias distribution and well defined only if the scale used for saliency maps is meaningful. In other words, if one applies a nonlinear invertible function to a saliency map, the ELLR changes.

Fixation-based KL divergence is the more common variant in the literature: Researchers wanted to apply information theoretic measures to saliency evaluation while remaining consistent with the rank-based scale of AUC (40). Therefore, they did not interpret saliency maps themselves as probability distributions, but applied the KL divergence to the distribution of saliency values obtained when using the fixations to that obtained when using nonfixations. We emphasize that this measure has an important conceptual caveat: Rather than being invariant under only monotonic increasing transformations, KL divergence is invariant under any reparameterization. This implies that the measure cares only about which areas are of equal saliency, but does not care about which of any two areas is actually the more salient one. For illustration, for any saliency map S(x,y), its negative counterpart S¯(x,y):=sup(S)−S(x,y) is completely equivalent with respect to the fixation-based KL metric, even though for any two image regions S¯ would always make the opposite prediction about their salience (see Fig. S4 for this as well as other examples). Furthermore, the measure is sensitive to the histogram binning used, and in the limit of small bin width all models have the same KL divergence: the model-independent KL divergence between p(xfix) and p(xnonfix).

Image-based KL-divergence requires that the saliency maps are interpreted as probability distributions. Previous studies using this method (Table S3) simply divided the saliency values by their sum to obtain such probability distributions. However, they did not consider that this measure is sensitive to the scale used for the saliency maps. Optimization of the pointwise nonlinearity (i.e., the scale) has a huge effect on the performance of the different models. More generally, realizing that image-based KL divergence treats saliency maps as probability distributions means that other aspects of density estimation, like center bias and regularization strategies (blurring), must also be taken into account.

The only conceptual difference between image-based KL divergence and log-likelihoods is that for estimating expected log-likelihood ratios, it is not necessary to have a gold standard. One can simply use the unbiased sample mean estimator (SI Text, Estimation Considerations). Furthermore, by conceptualizing saliency in an information-theoretic way, we can not only assign meaning to expected values (such as ELLR or DKL) but also know how to measure the information content of an individual event (here, a single fixation), using the notion of its log-loss (see our application on the individual pixel level in the main text). Thus, although on a theoretical level log-likelihoods and image-based KL divergence are tightly linked, on a practical level a fundamental reinterpretation of saliency maps as probability distributions is necessary.

Estimation Considerations

One principle advantage of using log-likelihoods instead of image-based KL divergence is that for all model comparisons except comparing against the gold standard we do not have to rely on the assumptions made for the gold standard but can simply use the unbiased sample mean estimator:E^[log⁡q1(x)/q2(x)]=1N∑k=1Nlog⁡q1(xk)/q2(xk).This is why we used the sample mean estimator for all model comparisons rather than the gold standard to estimate the ELLR.

However, estimating the upper limit on information gain still requires a gold standard [an estimate of the true distribution p(x)]. Image-based KL divergence requires this not only for estimating the upper bound, but also for calculating the performance of any model. There, it has usually been done using a 2D histogram or Gaussian kernel density estimate (Table S3), and the hyperparameters (e.g., bin size, kernel size) have commonly been chosen based on fovea size or eye tracker precision. In our framework of interpreting saliency maps as probability distributions, a principled way of choosing these hyperparameters is to cross-validate over them to get the best possible estimate of the true distribution.

For our dataset, the optimal cross-validated kernel size was 27 pixels, which is relatively close to the commonly used kernel size of 1∘ (37 pixels). However, with more fixations in the dataset the optimal cross-validated kernel sizes will shrink, because the local density can be estimated more precisely. Therefore, choosing these hyperparameters on criteria other than cross-validation will produce inaccurate estimates of the ELLR in the large data limit.

Because we conclude that our understanding of image-based saliency is surprisingly limited, we have been using a conservative strategy for estimating the information gain of the gold standard that is downward biased such that we obtain a conservative upper bound on the fraction of how much we understand about image-based saliency. To this end, we not only used the unbiased sample estimator for averaging over the true distribution but also resorted to a cross-validation strategy for estimating the gold standard that takes into account how well the distributions generalize across subjects,E^[pgold]=∑j=1M1Nj∑k=1Njlog⁡pgold(xjk|j),where the first sum runs over all subjects j and pgold(xjk|j) denotes a kernel density estimator that uses all fixations but the one of subject j. For comparison, if one would simply use the plain sample mean estimator for the gold standard, the fraction explained would drop to an even smaller value of only 22%. Our approach guarantees that it is very likely that the true value falls into the range between 22% and 34%.

Generalization to Spatiotemporal Scanpaths

The models we consider in this paper are purely spatial: They do not include any temporal dependencies. A complete understanding of human fixation selection would require an understanding of spatiotemporal behavior, that is, scanpaths. The model adaptation and optimization procedure we describe above can be easily generalized to account for temporal effects, as follows.

A scanpath consists of N fixations with positions xi, yi, ti, where xi and yi denote the spatial position of the fixation in the image and ti denotes the time of the fixation. A scanpath can be viewed as a sample of a 3D point process (12). Conceiving of scanpaths as 3D point processes allows us to model the joint probability distribution of all fixations of a subject on an image. In general, a model’s average log-likelihood is 1N∑klogp^(xk), where p^ is the probability distribution of the model and xk, k=1,…,N are samples from the probabilistic process that we would like to model. Our likelihoods are therefore of the form p^(x1,y1,t1,…,xN,yN,tN,N|I), where N is part of the data distribution (not a fixed parameter) and I denotes the image for which the fixations should be predicted. By chain rule, this is decomposed into conditional likelihoods p^(x1,y1,t1,…,xN,yN,tN,N|I)=p^(N|I)∏i=1Np(xi,yi,ti|N,x1,y1,t1,…,xi−1,yi−1,ti−1,I).

The above holds true for any 3D point process. In this way, the model comparison framework we propose in this paper is general in that it can account for spatiotemporal fixation dependencies (see ref. 12 for a recent application of spatiotemporal point processes to the study of scanpaths).

Acknowledgments

We thank Lucas Theis for his suggestions and Eleonora Vig and Benjamin Vincent for helpful comments on an earlier draft of this manuscript. We acknowledge funding from the Deutsche Forschungsgemeinschaft (DFG) through the priority program 1527, research Grant BE 3848/2-1. T.S.A.W. was supported by a Humboldt Postdoctoral Fellowship from the Alexander von Humboldt Foundation. We further acknowledge support from the DFG through the Werner-Reichardt Centre for Integrative Neuroscience (EXC307) and from the BMBF through the Bernstein Center for Computational Neuroscience (FKZ: 01GQ1002).

Similar Articles

You May Also be Interested in

Researchers report links between warming and predator-prey interactions in the Arctic and suggest that predator activity can influence carbon and nitrogen dynamics in the Arctic, but that warming may alter or reverse such effects.

A study finds that individuals with major depressive disorder had lower blood levels of acetyl-L-carnitine (LAC) than healthy controls, suggesting that LAC might aid the diagnosis of severe, trauma-associated depression.

A study explores historical fire activity associated with bison hunting by indigenous groups in North America, and suggests that fire use by indigenous hunters might have amplified the effect of climate variability on fire activity in the North American Great Plains.