Carnegie Mellon Invention Captures Social Motion

By Dian Schaffhauser

07/14/2017

A research breakthrough at Carnegie Mellon University's Artificial Intelligence initiative enables a computer to understand body poses and movements of multiple people in video in real time — including the signals indicated by their fingers. The result could be a giant leap forward in how computers capture even the subtlest social interactions for behavioral analysis, even when bodies block the full view.

The project, named OpenPose, is a code library that allows for "real-time, multi-person keypoint detection." As explained on the GitHub site where the code for the project lives, "OpenPose represents the first real-time system to jointly detect human body, hand and facial keypoints (in total 130 keypoints) on single images."

It was made possible with the use of the university's Panoptic Studio, which can view and record synchronized video streams of several people engaged in physical activities. None of them wears any special markers or trackers. The collective output of these numerous 2D images is a 3D visualization showing anatomical "landmarks" or trackers placed on individuals in space. The studio, built a decade ago, is a geodesic sphere with a radius of almost 5.5 meters — large enough to hold a group of people who can interact with one another. The dome is outfitted with 480 cameras mounted on the inside surface, generating a data stream of about 29 Gbps, and five Microsoft Kinect IIs calibrated with the cameras.

The approach used by the team was to "localize" body parts in a scene — arms, legs, faces, hands — and associate those parts with individuals. Doing that for hands is hard because a camera won't see all parts of the hand in a single shot. And unlike with face and other body parts, which have been amply captured and tagged by part and positioning, large datasets of hand images don't exist.

Using the studio with its multiple cameras, the researchers could have recorded 500 simultaneous views of a person's hand. However, because hands are small — "too small to be annotated by most of our cameras," according to Hanbyul Joo, a Ph.D. student in robotics — the project could get away with the use of 31 high-definition cameras. Then Joo and another Ph.D. student used their own hands to generate thousands of views that were used in the latest research.

According to Yaser Sheikh, associate professor of robotics and a member of the research team, the technology could open new methods for people and machines to interact with each other. For example, the ability to recognize hand poses offers the possibility of people interacting with computers in new and more natural ways, such as communicating with computers just by pointing at things. Robots could also be wired to "perceive" what the people around them are doing or are about to do, what kinds of moods they're in and whether they can be interrupted.

"We communicate almost as much with the movement of our bodies as we do with our voice," Sheikh said in an article about the project. "But computers are more or less blind to it."

Sheikh suggested several use cases, such as a self-driving car that could be "warned" about a pedestrian showing signs that he or she is about to step into the street or a robot that could detect conditions such as depression.

The researchers are making their computer code and dataset for both multiperson and hand-pose estimation openly available to push others to come up with their own applications. Already 20 commercial groups have expressed interest in licensing the technology.

Shortly, Sheikh and his fellow researchers will also do presentations on their latest work at CVPR 2017, the Computer Vision and Pattern Recognition Conference, in Honolulu.

About the Author

Dian Schaffhauser is a senior contributing editor for 1105 Media's education publications THE Journal and Campus Technology. She can be reached at dian@dischaffhauser.com or on Twitter @schaffhauser.