SkyFall

Project description

INTRODUCTIONSkyfall is a physics based game in which users can control an onscreen paddle simply by moving their hands in front of a 2d camera. Light weight Convolutional Neural Networks are used to detect the users hands which is then mapped to the controls of the game. The structure of the interaction supports multiple players (provided they can be accommodated in the field of view of the camera). The system demonstrates how the integration of a well-trained and light weight hand detection model (treated as a first-class object in the interaction design) is used to robustly track player hands and enable “body as an input” interaction in real-time (~30 fps).

This approach can be expanded beyond the capabilities shown within SkyFall (which is a very early prototype). First it can be extended to allow integration of other types of models such as light-weight speech, language generation, generative models etc. The approach can leverage benefits from continuous improvements (better data, better optimizations) in each of these models. Finally, the approach can be expanded to allow for simultaneous integration of multiple pre-trained community models – e.g. a model trained to detect hands in addition to a model trained to detect raccoons in implementing a “raccoon catcher game on live video”.

SYSTEM ARCHITECTURE

This section describes the architecture and components that comprise the Skyfall game – a model curation system and a consumer application.

Model Curation System

The model curation system is built using python and relies on the Tensorflow, OpenCV libraries. It has the following modules.

· Video Stream Processor Module [Python, Open CV]

A threaded process that captures video frames from the NVIDIA TX2 camera and deposits them in an image input queue.

· Model Loader and Broadcast Module [Python, Tensorflow]

Spins up multiple threads that load an interaction model into memory. Each thread fetches an image from the image input queue, runs the object detector model against the image and broadcasts the results (class tags and bounding boxes) over a websocket to all connected clients.

· Model Output Manager [Python].

Given that each frame is processed independently by different threads, the model output manager serves to compute a history of objects seen across frames (e.g. if the hand seen in framet is the same as the hand seen in framet-1). Currently this computation is done using a naïve Euclidean distance approach.

Trained AI Interaction Model

In this submission use case (SkyFall), the interaction model used is a hand detection model - a CNN trained on a dataset [3] of hands (4800 images). The model is trained using transfer learning via the tensorflow object detection API. The model is initialized from pre-trained model with the Mobilenet architecture [6]. I have published a longer description[2] (and code) on how this model was trained.

Skyfall Game

System Architecture

Skyfall is a browser based game served from a web application. The web application connects with a socket server on the Model Curation System. To launch the game, web application can be accessed via the webserver url.

· The play mechanism for SkyFall is simple. 3 types of balls fall from the top of the screen in random order – white balls (worth 2 points), green balls (worth 5 points) and red balls (worth -5 points). Players earn points by moving a paddle such that each ball bounces on the paddle.

· WebSocket Module [Javascript]

· Connects to the Model Curation System websocket and listens for output from the trained AI model.

· Model to Controls Map [Javascript]

· Model output is mapped to game controls or used to generate new content within the game. In the SkyFall example, the position of the plyers hands (x-axis) is used to position the play paddle.

NVIDIA TX2 Setup

The TX2 used in testing the application was flashed using the latest Jetpack version 3.1. Next, Tensorflow was built from source on the TX2 and installed. To allow for easy implementation and testing, most of the coding and testing process was done using Tensorflow on a Mac computer and then intermittently tested on the TX2.

Clocking the NVIDIA TX2 for max performance (jetson_clock.sh) provided a significant speed up in performance – hand detection frames per second increased from 4 FPS to 21 FPS.

IMPACT OF THIS WORK

Improved Interaction Design

By advancing the concept of lightweight AI models as first class citizens in interaction design, this work demonstrates avenues for creating user experiences with improved immersion, engagement and enjoyment for resource constrained devices.

A community of continuously optimized AI models

This work also represents a first step towards curating a community of lightweight optimized AI models that specialize in certain interaction tasks. This may include models for tracking body parts, language processing, speech processing etc. As a starting point, I have made the trained AI model for hand tracking available as an open source Tensorflow checkpoint model which other users can integrate into the development of their own applications.

Demonstrating Capabilities on Edge Devices like NVIDIA TX2

This project gives a sense of the extensive processing capabilities of the TX2. My tests surprisingly showed better FPS compared to a MacBook pro (i7, 2.4GHz, Quadcore). This work also can become an initialization point for deploying meaningful compact systems that leverage computer vision for multiple use cases. For example, by tracking hands in real time we can develop a compact system capable of tracking gestures and using that to predict how expressive a presenter is as they navigate the stage during a presentation. Given that hands (and joints) are a critical part of dance expressing, similar models can be learned to score dance performance (rhythm) with a given musical piece.

NEXT STEPS/FUTURE WORK

Future work for this project includes improvements on software, interaction model accuracy and some implementation optimizations.

Software

Improved object identification across scenes.

Currently the algorithm used to identify objects across frames is a naïve algorithm based on Euclidean distance. Further work needs to be done to improve this algorithm and possibly integrate it with other available tracking algorithms.

Improved translation of hand positions and game controls

More work will be done to improve the mechanisms for mapping hand location to game controls. This will ensure smoother controls, better multiplayer experience (current multiplayer is slightly glitchy), etc.

Improved game mechanics, New Games/Interaction Use cases

More work will be done to improve the game play mechanics to improve engagement. Work will also be done to implement this in new games (e.g. platformers where the primary interactions can be hand movements) and interaction use cases .

Model Accuracy

Better Training Dataset for hand tracking model

Qualitative evaluations suggest the hand tracking model excels at tracking hands in multiple lighting and orientation conditions. However, it is still limited and there is opportunity to further improve this accuracy by assembling a larger and more diverse dataset. This will be a critical part of future work and will amount to benefits that can be shared by others who use the hand tracking model.

Implementation Optimizations

Speed Optimization - Integrating TensorRT

Further work is needed to ensure this application takes advantage of optimizations within TensorRT to improve performance.

CONCLUSION

In this work, I have presented some ideas around integrating AI as a first-class citizen in designing interactions. I present a concrete example of this using a simple physics based game – SkyFall. This approach can be expanded beyond the capabilities shown within SkyFall. First it can be extended to allow integration of other types of models such as light-weight speech, language generation, generative models etc. The approach can leverage benefits from continuous improvements (better data, better optimizations) in each of these models. Finally, the approach can be expanded to allow for simultaneous integration of multiple pre-trained community models – e.g. a model trained to detect hands in addition to a model trained to detect raccoons in implementing a “raccoon catcher game on live video”, a dance tracking game etc.

[1] Note: Frames per second is low when the TX2 is not clocked to maximum speed.