Understanding and Describing Tennis Videos

Abstract

Our most advanced machines are like toddlers when it comes to sight.’ When shown a tennis video to kid, he mostly probably would blabber words like ‘tennis’, ‘racquet’, ‘ball’ etc. Similar is the case with present day state-of-art video understanding algorithms. We in this work try to solve one such multimedia content analysis problem – ‘How to get machines go beyond object and action recognition and make them understand lawn tennis video content in a holistic manner ?’. We propose a multi-facet approach to understand the video content as a whole - (a) Low level Analysis: Identify and isolate court regions and players (b) Mid Level Understanding: Recognize players actions and activities (c) High Level Annotations: Generate detailed summary of event comprising of information from full game play.

Annotating visual content with text has attracted significant attention in recent years. While the focus has been mostly on images, of late few methods have also been proposed for describing videos. The descriptions produced by such methods capture the video content at certain level of semantics. However, richer and more meaningful descriptions may be required for such techniques to be useful in real-life applications. We make an attempt towards this goal by focusing on a domain specific setting – lawn tennis videos. Given a video shot from a tennis match, we intend to predict detailed (commentary-like) descriptions rather than small captions. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set comprising of broadcast video recordings of matches from London Olympics 2012. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task.

Given a test video, we predict a set of action/verb phrases individually for each frame using the features computed from its neighborhood. The identified phrases along with additional meta-data are used to find the best matching description from the commentary corpus. We begin by identifying two players on the tennis court. Regions obtained after isolating playing court regions assist us in segmenting out the candidate player regions through background subtraction using thresholding and connected component analysis. Each candidate foreground region thus obtained is represented using HOG descriptors over which a SVM classifier is trained to discard non-player foreground regions. The candidate player regions thus obtained are used to recognize players using using CEDD descriptors and Tanimoto distance.Verb phrases are recognized, by extracting features from each frame of input video using sliding window. Since this typically results into multiple firings, non-maximal suppression (NMS) is applied.

This removes low-scored responses that are in the neighborhood of responses with locally maximal confidence scores. Once we get potential phrases for all windows along with their scores, we remove the independence assumption and smooth the predictions using an energy minimization framework. For this, a Markov Random Field (MRF) based model is used which captures dependencies among nearby phrases. We formulate the task of predicting the final description, as an optimization problem of selecting the best sentence among the set of commentary sentences in corpus which covers most number of unique words in obtained phrase set. We even employ Latent Semantic Indexing (LSI) technique while matching predicted phrases with descriptions and demonstrate its effectiveness over naive lexical matching. The proposed pipeline is bench-marked against state-of-the-art methods. We compare our performance with recent methods. Caption generation based approaches achieve significantly low score owing to their generic nature. Compared to all the competing methods, our approach consistently provides better performance. We validate that in domain specific settings, rich descriptions can be produced even with small corpus size.

The thesis introduces a method to understand and describe the contents of lawn tennis videos. Our approach illustrates the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce human-like descriptions. The method has direct extensions to other sports and various other domain specific scenarios. With deep learning based approaches becoming a de-facto standard for any modern machine learning task, we wish to explore them for present task in future augmentations. The flexibility and power of such structures have made them outperform other methods in solving some really complex vision problems. Large scale deployments of combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have already surpassed other comparable methods for real time image summarization. We intend to exploit the power of such combined structures in VIDEO TO TEXT regime and generate real time commentaries for the game-videos as one of the proposed future extensions.