Research: Motion History Images

A new view-based template approach to the representation of action is presented. The work is motivated by the observation that a human observer can easily and instantly recognize action in extremely low resolution imagery with no strong features or information about the three-dimensional structure of the scene. Our underlying representations for action are view-based template descriptions of the coarse image motion. Using these descriptions, we propose an appearance-based recognition strategy embedded within a hypothesis-and-test paradigm.

A binary motion energy image (MEI) is initially computed to act as an index into the action library. This coarsely describes the spatial distribution of motion energy for a given view of a given action. Any stored MEIs that plausibly match the unknown input MEI are then tested for a coarse, motion history agreement with a known motion model of the action.

A motion history image (MHI) is the basis of that representation. The MHI is a static image template where pixel intensity is a function of the recency of motion in a sequence. Recognition is accomplished in a feature-based statistical framework.

The motion template technology has been used to recognize human movements within interactive environments such as Virtual PAT and The KidsRoom.

Motivation

The motivation for the approach presented in this research can be demonstrated in a single video-sequence (See blurred action sequence to the left). The video is a tremendously blurred sequence (in this case an up-sampling from images of resolution 15x20 pixels) of a human performing a simple, yet readily recognizable, activity. When shown this video, the vast majority of a room full of spectators could identify the action in less than one second from the start of the sequence. What should be quite apparent is that most of the individual frames contain no discernible image of a human being. Even if a system knew that the images were that of a person, no particular pose could be reasonably assigned due to the lack of features present in the imagery.

When viewing the motion in a blurred sequence, two distinct patterns are apparent. The first is the spatial region in which the motion is occurring. The pattern is defined by the area of pixels where something is changing largely independent of how it is moving. The second pattern is how the motion itself is behaving within these regions (e.g. an expanding or rotating field in a particular location). We developed our methods to exploit these notions of where and how, believing that these observations capture significant motion properties of actions that can be used for recognition.

Motion Energy Image (MEI)

Given a rich vocabulary of motions that are recognizable, an exhaustive matching search is not feasible, especially if real-time performance is desired. In keeping with the hypothesis-and-test paradigm, the first step is to construct an initial index into the known motion library. Calculating the index requires a data-driven, bottom up computation that can suggest a small number of plausible motions to test further. We develop a representation of the spatial distribution of motion (the where), which is independent of the form of motion (the how), to serve as our initial index.

Consider the example of someone sitting, as shown in the figure below. The top row contains key frames from a sitting sequence. The bottom row displays a cumulative binary motion energy image (MEI) sequence corresponding to the frames above. The MEIs highlight regions in the image where any form of motion was present. The summation of the square of consecutive image differences often provides a robust spatial motion-distribution signal. Image differencing also permits real-time acquisition of the MEIs. As expected, the MEI sequence sweeps out a particular (and perhaps distinctive) region of the image. Our claim is that the shape of the region can be used to suggest both the action occurring and the viewing condition (angle).

Motion History Image (MHI)

Consider the picture shown below left. This image captures the essence of the underlying motion pattern of someone sitting (sweeping down and back) superimposed on the corresponding MEI silhouette. Here, both where (MEI silhouette) the motion is happening and also how (arrows) the motion is occurring are present in one compact template representation. This single image appears to contain the necessary information for determining how a person has moved during the action. In our approach, we collapse the temporal motion information into a single image template where intensity is a function of recency of motion. The resultant image yields a description similar to the "arrow" picture.

To represent how motion is moving, we developed a motion history image (MHI). In an MHI, pixel intensity is a function of the motion history at that location, where brighter values correspond to more recent motion. We currently use a simple replacement and linear decay operator using the binary image difference frames. Examples of MHIs for three actions (sit-down, arms-raise, crouch-down) are presented in the figure below right. Notice that the final motion locations appear brighter in the MHIs.

Recognition

Results show reasonable recognition within an MHI verification method which automatically performs temporal segmentation, is invariant to linear changes in speed, and runs in real-time on a standard platform.

Here is a short demo of the current system using two cameras. The top two images show the camera input with motion bounding regions. Bounding boxes are used to account for the possibility of multiple (separate) people/objects. White boxes identify valid motion regions. The middle two images show the corresponding MHI images for the above frames. The "virtual" room at the bottom shows an avatar of me in particular poses when the system identifies any of the recognizable actions (sitting, waving, crouching).

Recent Work

To address previous problems related to global analysis and limited recognition with MHIs, we have developed a hierarchical extension to the original MHI framework to compute dense (local) motion flow directly from the MHI. A hierarchical partitioning of motions by speed in an MHI pyramid enables efficient calculation of image motions using fixed-size gradient operators. To characterize the resulting motion field, a polar histogram of motion orientations is described. The hierarchical MHI approach remains a computationally inexpensive method for analysis of human motions.

Related Papers

Hierarchical Motion History Images for Recognizing Human Motion
J. Davis
To appear in IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, Canada, July 8, 2001.