Action Recognition in Video

Description: A trainable system was built to recognize actions in videos.
The first layer is a Convolutional Gated Restricted Boltzmann Machine, which is trained
in an unsupervised manner. It automatically learns features that primarily encode motion.
The second layer uses sparse coding to learn mid-level features in an unspervised manner.
The feature vectors thereby obtained are pooled over time, using a max-pooling operation,
and fed to a Support vector Machine. Excellent performance was obtained on the Hollywood-2
dataset. A similar system was built to recognize actions on the KTH dataset. It also
uses a CGRBM at the first layer, but uses a 3D (spatio-temporal) convolutional network
architecture for the following layers.