Humans can easily observe events and predict or anticipate events that could probably happen in the near future but this predictive behavior has always been difficult for AI. But now researchers at Google have proposed VideoBERT, a self-supervised system that is able to perform predictions from unlabeled videos.
. "Speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision.", wrote Google researchers in a blog post.
.
VideoBERT makes use of Google's BERT to learn the details of the video. Notably, BERT(Bidirectional Encoder Representations from Transformers) is the cutting-edge model used by Google for NLU (natural language understanding) based applications.
.
Google used frames of the images combined with automatic speech recognition sentence outputs to convert them into visual tokens of 1.5-second duration. These visual tokens are then concatenated with the word tokens. The missing tokens were filled out by using the VideoBERT model.
.
The blog explains how the researchers trained VideoBERT on over one million instructional videos on cooking, gardening, and vehicle repair. The researchers also verify the outputs of VideoBERT to evaluate the accuracy of the model.
. . "Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation.", concluded the researchers.
.
.
.
follow us for more updates @technbs
.
.
. #google#ai#artificialintelligence#videobert#tech#project#training#testing#recognition#speech#validation#action#predict#predictiveprogramming#programming#coding#encoder#neuralnetworks#blog#speechrecognition ...