Kirsty McNaught, Maneesh Sahani, University College London, United Kingdom

Abstract:

Human vision is foveated, with much higher resolution at the centre of gaze compared to peripheral areas. When viewing a scene, humans move their eyes several times a second to bring different parts of the scene into their foveal vision. This necessitates an active sampling mechanism for optimal parsing of scene content. Previous work has shown natural image statistics have higher contrast in locations that are fixated compared to other image locations. We propose a normative explanation for these observations by calculating the expected information gain associated with a particular fixation based on natural image statistics at different resolutions. We train a model to predict the expected foveal retinal ganglion cell responses for an image patch given an existing observation at peripheral resolution. Our model outputs both a mean prediction and an uncertainty. We predict that patches of the image for which the prediction has high uncertainty offer the most information gain and should therefore be strong contenders for the next fixation. We analyse human gaze data to show that fixated image patches are associated with a higher conditional entropy than a reference ensemble, and fixation durations are positively correlated with conditional entropy (expected surprise).