Cognitive Sciences Stack Exchange is a question and answer site for practitioners, researchers, and students in cognitive science, psychology, neuroscience, and psychiatry. It's 100% free, no registration required.

I've gathered the standard rational for a visual system utilizing saccades from perception textbooks: the neural cost of processing an entire scene at a high level of detail would be prohibitive, but low-fidelity images aren't good enough to function in the world. Thus you have an retina with a high-fidelity center which can sample the scene by saccading around, presumably guided by a combination of information about the low-fidelity surround and the beliefs and goals of the perceiver.

However, I've yet to find a detailed theory that answers the question of how the next saccade target selected.

To clarify, by 'detailed theory', I mean a theory that describes the mechanisms used to accomplish the computational feats described in the standard textbook explanation I mentioned above.

1 Answer
1

Treisman & Gelade's Feature Integration Theory suggests that we are able to process an entire visual scene in parallel at the level of individual features. For example, in a visual search task, the time required to find a blue circle in a field of red circles is independent of the total number of circles. However, focused attention (typically foveal) is required to integrate independent features into a cohesive object. Thus, if searching for a red circle in a field of blue circles and red squares, search time grows linearly with the number of total objects. This is because the target is made up of two features (circle and red) which need to be integrated in order to be identified-- requiring saccades around the scene.

Several theories of visual search use this distinction to model visual attention shifts. Most notably, Jeremy Wolfe's Guided Search and Itti & Koch's visual attention model. The basic premise behind both models is somewhat similar: low level feature receptors respond automatically and in parallel to the entire visual field. Thus, there are many individual feature maps that represent bottom-up saliency of locations in the visual scene. This bottom-up saliency can be sufficient to trigger a saccade; for instance, a feature map that responds to local motion is beneficial for an organism to identify moving predators. Thus, areas with motion are given high value because they have a history of providing information that is beneficial to an organism.

During task conditions (such as visual search), top-down saliency maps may also be created based on knowledge of what things in the environment have value. If I am searching for my umbrella, I know that it is blue and long and straight, and this information can be encoded in the feature maps that drive saccades.

More generally, saccades are directed at targets that have a high expected value. (It has even been shown that the velocity of saccades is proportional to the expected value of the target: Shadmehr, et al.) This value is determined from a weighted evaluation of both top-down and bottom-up feature maps, available pre-attentively.

The exact location of a saccade is determined through a process called spatial pooling, which attempts to determine the 'center of gravity' of a target, again using low level feature maps. While saccades are amazingly quick and accurate, there is of course some error in final saccadic position which often require smaller saccades to reach the target. It has recently been suggested that this series of saccadic movements emulate Fitts' law with regards to speed-accuracy tradeoffs. A great, thorough review of the current state of saccadic eye movements can be found in Kowler, 2011.

There is obviously quite a bit of detailed information that I haven't covered here-- out of the references cited, I would start with section 3 ("Saccades") of the Kowler article, then move on to the Itti & Koch article for more concrete details on their specific model.