READING

Ji et al. propose to use 3D convolutional neural networks for action recognition. While the proposed approach – using regular 3D convolutions in an architecture depicted in Figure 1 on 7 consecutive frames of size $60 \times 40$ - is quite simple, it is interesting that the paper was first published in 2010, 2 years before Krizhevsky's ground-breaking work on the ImageNet [] challenge and the revival of deep learning in computer vision. Still, the approach is quite limited, trained on $7 \times 60 \times 40$ (with 5 channels for each frame), it is questionable whether they would have been able to scale their system to higher spatial or temporal resolution. As feature channels for each frame, they use gray scale, gradients in $x$ and $y$ direction as well as optical flow. The number of parameters, $295458$, is also considerably low compared to the AlexNet [1] with roughly $60$ million parameters.

Figure 1: The proposed architecture where the input consists of 7 frames each providing 5 different feature channels.