Training a video annotation system with Grand Theft Auto

One of the more interesting research projects underway at CCRi uses machine learning to automatically generate text descriptions of action happening in videos, complete with a color-coded playback slider showing where in the video the action takes place.

As with so many machine learning projects, the system starts by using neural networks to analyze input where the work has already been done—in this case, videos that already have textual narration about their action. The system then stores information about the relationships between the videos and their accompanying descriptions so that it can use this relationship data (and recurrent neural networks) to generate descriptions of new videos.

That first part is not as easy as it sounds: where do you get videos that are annotated with textual descriptions of their action? YouTube descriptions are short and summarize entire videos; they rarely tell you much about events happening at specific points within a video. So, in order to create this training data, our researchers watched a lot of videos and manually typed out descriptions. This was laborious, time-consuming work, and they would have rather been playing with video games.

So they did. For 20 years, successive releases of the popular video game Grand Theft Auto (GTA) have let its players interactively explore action-packed worlds of violence and crime. More recent releases, particularly GTA V, let people load community-developed libraries such as Script Hook V and Script Hook V .NET to call their own code and create scripts that can communicate with the RAGE game engine that provides GTA’s foundation.

This communication lets the scripts control the action and read information about what’s happening in the game. For example, a script can add new pedestrians (or, in GTA lingo, “peds”) to walk around or add cars to drive around, and it can load up a queue of actions for these peds and cars to execute.

The script can also read information about the people and vehicles in the game world. For example, your script can tell GTA to create or identify a random entity and then read information about it —for example, an “average male with toe shoes” or “Mexican female with sports bag.”

In CCRi’s project, after creating people, the scripts then go on to find out the color of their clothes and other information about what they’re doing, as well as relevant information for vehicles. They then generate descriptions such as “a Female wearing a white shirt and dark pants is waiting to cross the street” or “a black sedan is parked.” These scripts also identify bounding boxes around vehicles and pedestrians so that data about their locations in each frame can be recorded.

While it gathers this information, the program can display the extracted information with the entities as they move around:

While that’s fun to watch, the timestamped log of the action that also gets created is more useful. We can use this log to train the neural network that learns correlations between text and images so that it can use data about these correlations to generate descriptions of new videos.

We use a lot of software tools for development and research at CCRi. An interactive video game full of shootouts, fistfights, and car accidents may seem like an unlikely tool for neural net-based video annotation, but it’s turned out to be great.