Multiple object tracking (MOT) is the ability to individuate a moving object based solely on its spatial-temporal history. We examined whether MOT is based on a scene-based (allocentric) or image-based (egocentric) representation.

Observers viewed 16 objects moving in a depicted 3D wireframe box. On each trial, 2, 4, or 6 objects were briefly tagged as the ‘target’ class. All objects then underwent 10 s of random motion (1 or 6 deg/s) before stopping. A single object was then tagged, which the observer identified as a target or a non-target.

Preliminary experiments established that MOT was impaired by increases in both size of the target class and speed of object motion. Next, the motion pattern of the 3D box was manipulated. Thus, in addition to varying the speed of objects relative to the center of the box (object motion), the motion of the whole box was varied (scene motion). Unlike variations in object motion, which had a large influence on accuracy, variations in scene motion had no measurable influence. This was true whether the scene underwent translation, zoom, rotation, or even a combination of all three motions (‘combined motion’).

To tax the ability to use a scene-based representation, we projected the ‘combined motion’ condition onto an obliquely viewed surface. This created retinal motions of the objects and box consistent with an orthogonal view, but the apparent motions underwent large changes because of the affine stretching of the projected image. Nonetheless, MOT accuracy was unaffected. Accuracy was only reduced when we projected the ‘combined motion’ onto a convex corner formed from the junction of two surfaces, the same conditions under which pictorial shape constancy is no longer possible.

These results imply that MOT is accomplished with a scene-based representation. It is motion of objects relative to the larger scene that determines performance, not motion of objects relative to egocentric landmarks like retinal location.