In this study, the authors investigate the possibility of boosting action recognition performance by exploiting the associated scene context. Towards this end, the authors model a scene as a mid-level ‘middle layer’ in order to bridge action descriptors and action categories. This is achieved via a scene topic model, in which hybrid visual descriptors, including spatial–temporal action features and scene descriptors, are first extracted from a video sequence. Then, the authors learn a joint probability distribution between scene and action using a naive Bayes nearest neighbour algorithm, which is adopted to jointly infer the action categories online by combining off-the-shelf action recognition algorithms. The authors demonstrate the advantages of their approach by comparing it with state-of-the-art approaches using several action recognition benchmarks.