Artist and programmer Gene Kogan ran the Boston Dynamics video demonstrating their upgraded Atlas robot through the Densecap captioning system, which tries to identify objects in a video. The system is both impressive and at times wildly inaccurate, labeling the robot in the resulting video as a variety of incorrect things like a person skiing, a motorcycle, or a fire hydrant.

Captions are generated by densecap on individual video frames. The video is made by a python script which merges matching captions along sequences of consecutive frames with a set of (mostly greedy) heuristics. Presumably, it would be possible to caption sequences of regions directly rather than a naive merging algorithm, but I’m not sure how :)