Pages

Wednesday, September 7, 2016

One thousand video clips of one hundred games. The games are clustered according to their t-SNE components applied on the output of a CNN trained to classify RTS games. You can interact with the interfacehere.

The figure above generated in fast motion.

The figure above is from an interactive demo for this article that you can find here. I recommend that you have a look at it before continue reading.

Introduction

A while ago, I started working on convolutional neural network within the computer game domain. I was particularly interested in expanding their success to video games and investigate whether they can be used to learn features about games similar to what they do with images and videos in many other areas. In this post, I will explain what I did so far and I will show some of the recent results.

Goal

There has been a lot of work recently on video classification, tagging and labelling. My interest lies in bringing these ideas to games. My hypothesis is that video game trailers and gameplay videos provide rich information about the games in terms of visual appearance and game mechanics that would allow CNNs to detect similarities along a number of dimensions by "watching" short video clips.

Gameplay 2M dataset

As you already know, CNNs are data hungry, so I started by collecting the data I need. I was looking for videos of gameplay classified according to a number of categories. The easiest way I found to collect the data is to prepare a list of game titles, download YouTube videos of gameplay form different channels and associate each game with a set of categories I eventually got form Steam.

So, I initialised the process and I started running experiments when I had data for 200 games ready. For each game, I downloaded 10 gameplay video. Since those vary in length, I cropped a 5-minute segment from each of them. Then for each segment, I randomly sampled 10 half-second shorter clips. Finally, from these short clips I extract 100 frames. If you do the calculation, you will see that I ended up with 100*10*10 = 10000 gameplay images per game, so the dataset I will be using for this post contains 2M gameplay images.

As for the game classes, I query Stream on categories assigned to each game by the users. I ended up with a 24-D vector of categories including whether the game is an action, single-player, real-time strategy, platformer, indie, first-person shooter, etc. Each game is assigned to one or more of these categories. To create one category vector for each game, I averaged them per category and used a simple step function with a threshold of 0.5 and assign the final vector to each game (more specifically, to each image).

Here are some short clips from some of games I used for training and the categories they belong to according to Steam users:

Full Spectrum Worrier: RTS = 1, Action = 1, Single-player = 0

Empire Total War: RTS = 1, Action = 1, Single-player = 1

Team Fortress: RTS = 0, Action = 0, Single-player = 1

Method

Most of recent work in deep learning rely on established state-of-the-art models and fine tune it on a new dataset. I follow this stream of work as training from scratch is very time and resource consuming. Some state-of-the-art CNNs are very good in extracting visual feature representations from raw pixel data. In my work, I use the convolutional layers of the VGG-16 model to extract generic descriptors from the gameplay images.

I train on static images of gameplay extracted from the videos (I believe adding temporal information will improve the results, but I wanted to start simple and build from there). I built classifiers for only the three categories: RTS games, action games and single-player games as those provided the most balanced data in terms of belonging to positive and negative classes but I will be running more experiments once I have more data.

To build the classifiers, I first pass all images through the convolution layers of the popular VGG-16 model to extract the visual feature descriptors that I later use to train NN classifiers. Each classifier constitutes of the convolutional layers from VGG-16 then two dense layers of 512 nodes each. Finally, I use a sigmoid function that output the probability of an image belonging to a class.

I trained three binary classifiers to learn each category independently (I could as well have used other multilabel learning methods but this is what I use for now). I split the data into three sets for training (70%), validating (20%) and testing (10%).

VGG-16 artitecture with two dense layers of 512 nodes each.

Analysis, how good are the classifiers?

The three classifiers performed remarkably well in terms of classification accuracy. I got accuracy up to 85% when classifying action games on the image level and the results for RTS and single-player games were slightly lower reaching 0.76% and 0.72%. I also calculated the accuracies in other settings where I average the performance per 0.5-sec clips, 0.5-min clips and per game. In some cases, it seems that looking at multiple images will indeed increase the accuracy while in others (when classifying action games), the model was just as accurate on individual images as it is on the whole game.

Following some inspiring work (here and here), I further looked at the distribution of the classes according to the first two t-SNE components (performed on the PCA results of the output of the first dense layer of the classifiers). I did this for a sample of the dataset (neither my machine nor t-SNE has enough power to process the whole dataset) and you can clearly see the classification boundary between positive and negative samples on the 5-min clips.

t-SNE visualisation of the distribution of 15000

half-second clips classified by the RTS classifier.

t-SNE visualisation of the distribution of 15000
half-sec clips classified by the single-player classifier.

I also looked at the distribution of games as I thought this is particularly interesting because the network has no explicit information during training that specifies from what game the images come from (it only knows whether an image is from a particular class or not). If my genetic image descriptors are powerful enough, I expected images/clips of the same game to cluster together. So I regenerated the same figures as above, but this time the colour code I used was game titles so that images or clips belonging to the same game will be given the same colour.

Same figure as above but points are coloured
by game title (RTS classifier).

Same figure as above but points are coloured by game title (Single-player classifier).

You can clearly see some cluster of clips belonging to the same game preserved quite well. This is a really interesting finding as it seems that somehow the models learned an implicit representation of the games although they didn't really trained to recognise them.

This last finding meant that games with similar visual features according to a given category should also be projected close to each other. So this time, I visualised the distribution of 5-min clips from the RTS classifier while showing the title of the games. Here is how the figure looks like with some zoom-ins.

Some zoom-ins from the t-SNE distribution of the output of the RTS classifiers.

Analysis, how different is the data?

Of course some videos are better representative of a game than others and therefore I expect to get variations in accuracies on the images and videos levels. To give you an idea of how the accuracy changes per image, here are some of the results from the action-games classifier for seven games. The performance is clearly different among games but there are also clear fluctuations within the same game. For some games, such as Hexen II and Team Fortress (number one and five in the figure) you can confidently tell by looking at the graph that they have a strong action element.

Accuracy per image by the action game classifier for seven games.

So, why do some images give high accuracies while others don't. What is it that the network is interested in? Since I'm using a pre-trained models for visual feature extraction, visualising the convolutional layers won't really help. I instead looked at the individual images with high and low accuracies for some games. Here is an example from the game Hexen II when the classifier is trained to see it as an action game.

Accuracy per image for the game Hexen II by the classifier of action games.

What I can tell for now (from these snapshots and many others I visualised), is that the amount of lighting matters quite a lot, the more the light, the higher the action. Similar analysis in RTS games showed that panels such as these below, even when only partially shown, are what contribute the most to recognising games as RTS.

For some videos, the models are more confused. This happens a lot when the category classified is a minor feature of the game and not one of its main characteristics. This in fact is the main reason I prefer to use a sigmoid function as an output for the classifiers. I can then interpret the output in a probabilistic form and say that a low probability translates to showing a small amount of a specific feature. This allows me to better understand the games and means I can define a similarity function on these vectors to find out what games are similar to each other and in what aspects, but more in that in the future.

Finally, some snapshots from the demo you saw at the top of the page. Here, I tried to visualise the five-minute clips according to their t-SNE dimensions. Since I only care about their clusters, and not their exact position in the space, I calculated the distance between all of them and connect each node to 10 of its nearest neighbours. To make it easier to understand the graph, I also gave the nodes belonging to the same game the same colour. If you zoom-in you can see the titles of the games and what games are connected to each other. The figures below are from the results of the RTS classifier.

Now this certainly doesn't allow me yet to draw conclusions on what and how games are similar but I believe that with more data and classification of more dimensionalities, we can build a powerful tool for automatic content-based classification of games.