Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic

In this talk, I will present an implemented system, called Leonard, that classifies simple spatial motion events, such as "pick up" and "put down," from video input. Unlike previous systems that classify events based on their motion profile, Leonard uses changes in the state of force-dynamic relations, such as support, contact, and attachment, to distinguish between event types. Since force-dynamic relations are not visible, Leonard must construct interpretations of its visual input that are consistent with a physical theory of the world. Leonard models the physics of the world via kinematic stability analysis and performs model reconstruction via prioritized circumscription over this analysis. In this talk, I will present an overview of the entire system, along with the details of both the model reconstruction process and the subsequent event-logic inference algorithm that can infer occurrences of compound events from occurrences of primitive events. This inference algorithm uses a novel representation, called spanning intervals, to give a concise representation of the large interval sets that occur when representing liquid and semi-liquid events. I will illustrate how Leonard handles a variety of complex visual-input scenarios that cannot be handled by approaches that are based on motion profile, including extraneous object in the field of view, sequential and simultaneous event occurrences, and non-occurrence of events. I will also present a live example illustrating the end-to-end performance of Leonard classifying an event from video input.