Visual cues for view-invariant human action recognition

Human action is a visually complex phenomenon. Visual representation, analysis and
recognition of human actions has become a key focus of research in computer vision, artificial
intelligence, robotics and other related scientific disciplines. Various applications of
automated action recognition include but not limited to intelligent health care monitoring,
smart-homes, content based video search, animation and entertainment, human-computer
interaction and intelligent video surveillance. The main focus of all these application areas
surrounds a fundamental question: Given a human subject doing something in the field
of sensory input, what is the person doing? If machine is able to correctly answer this
question, it can greatly benefit computer vision system development and practical usage.
However, machine recognition of human action is a daunting task due to complex
motion dynamics, anthropometric variations, occlusion and high dependency over camera
viewpoint. In this thesis, we exploit the importance of rich visual cues from human actions
and utilize them to propose valuable solutions to human action recognition. The important
problem of view-invariance under viewpoint variations is taken as a case study. We collect
and explore these visual cues from geometrical relationships, spatio-temporal patterns and
features, frequency domain signal analysis, contextual associations of actions and derive
action representations for machine recognition.
Actions are known as spatio-temporal patterns and temporal order plays an important
role in their interpretations. We, therefore, explore invariance property of temporal
order of actions during action execution and utilize it for devising a new view-invariant
action recognition approach. We apply order constraint and feature fusion on local spatiotemporal
features. These features are representation of choice for action recognition due to
their computational simplicity, robustness to occlusion and minor view-point changes. We
introduce STOPs (spatio-temporal ordered packets) that combine discriminative characteristics
of multiple features for better recognition performance. In addition, we introduce
spatio-temporal ordering constraint that removes discrepancy of orderless formation of
bag-of-feature framework for action recognition.
Furthermore, to deal with limitations of feature based approaches, we explore multiple
view geometry which has alleviated various complex problems in computer vision.
We thoroughly study applications of static and multi-body flow fundamental matrix in
context of relating across-view information. We introduce spatio-temporally consistent
dense optical flow to avoid explicit manual human body landmark point detection and
explicit point correspondences. We employ rank constraint to derive novel tracking and
training-free action similarity measures across viewpoint variations.
Next, we investigate that despite the considerable success of geometrical techniques,
computational complexity due to dense optical flow calculations plays a hindering role.
Therefore, we study and track frequency domain analysis of action sequences. It leads
toward the derivation of spatio-temporal correlation filters that use frequency domain
filtering to give fast and efficient solutions to action recognition. However, these filters are
originally view-dependent solutions. To achieve this objective, view clustering is explored
that extends frequency domain techniques to achieve view-invariance. Contextual information is another important cue for interpreting human actions especially when actions exhibit interactive relationships with their context. These contextual clues become even more crucial when videos are captured in unfavorable conditions like extreme low light nighttime scenarios. We, therefore, take case study of night vision and present contextual action recognition at nighttime. We discover that context enhancement is imperative in such challenging multi-sensor environment to achieve reliable action recognition which leads us to develop novel context enhancement techniques for night vision using multi-sensor image fusion.
Extensive experimentation on well-known action datasets is performed and results
are compared with the existing action recognition approaches in literature. The research
findings in this thesis greatly encourage the exploitation of spatia-temporal visual cues for
deriving novel action recognition approaches and increasing their performance.