Computer systems that predict actions would open up new possibilities ranging from robots that can better navigate human environments, to emergency response systems that predict falls, to virtual reality headsets that feed you suggestions for what to do in different situations.

Scientists at Massachusetts Institute of Technology (MIT) in the US have made an important new breakthrough in predictive vision, developing an algorithm that can anticipate interactions more accurately than ever before.

Trained on YouTube videos and popular TV shows, the system can predict whether two individuals will hug, kiss, shake hands or slap five.

In a second scenario, it could also anticipate what object is likely to appear in a video five seconds later.

While human greetings may seem like arbitrary actions to predict, the task served as a more easily controllable test case for the researchers to study.

"Humans automatically learn to anticipate actions through experience, which is what made us interested in trying to imbue computers with the same sort of common sense," said Carl Vondrick, PhD student at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL).

"We wanted to show that just by watching large amounts of video, computers can gain enough knowledge to consistently make predictions about their surroundings," said Vondrick.

Researchers created an algorithm that can predict "visual representations," which are basically freeze-frames showing different versions of what the scene might look like.

The algorithm employs techniques from deep-learning, a field of artificial intelligence that uses systems called "neural networks" to teach computers to pore over massive amounts of data to find patterns on their own.

Each of the algorithm's networks predicts a representation is automatically classified as one of the four actions - in this case, a hug, handshake, high-five, or kiss.

The system then merges those actions into one that it uses as its prediction.

For example, three networks might predict a kiss, while another might use the fact that another person has entered the frame as a rationale for predicting a hug instead.

After training the algorithm on 600 hours of unlabelled video, the team tested it on new videos showing both actions and objects.

When shown a video of people who are one second away from performing one of the four actions, the algorithm correctly predicted the action more than 43 per cent of the time, which compares to existing algorithms that could only do 36 per cent of the time.

It is worth noting that even humans make mistakes on these tasks. For example, human subjects were only able to correctly predict the action 71 per cent of the time, researchers said.