This paper presents a method to extract a previous termpartnext term-based previous termmodelnext term of an observed scene from a video sequence. Independent motion is a strong cue that two points belong to different "previous termrigidnext term" entities. Conversely, things that move together throughout the whole video belong together and define a "previous termrigidnext term" object or previous termpartnext term. Successfully tracked features indicate trajectories of salient points in the scene. A triangulated graph connects the salient points and encodes their local neighborhood in the first frame. The length variation of the triangle edges is used to label them as relevant (on an object) or separating (connecting different objects). A following grouping process uses the motion of the triangles marked as relevant as a cue to identify the "previous termrigidnext term" previous termpartsnext term of the foreground or the background. The choice of the motion-based grouping criterion depends on the type of motion: in the image plane or out of the image plane. The result is a previous termhierarchicalnext term description (graph pyramid) of the scene, where each vertex in the top level of the pyramid represents a "previous termrigidnext term" previous termpartnext term of the foreground or the background, and encloses to the salient features used to describe it. Promising experimental results show the potential of the approach.