Microsoft Research

Dept. of Computer Science, UC Irvine

Dept. of Computer Science, Dartmouth University

Abstract

AutoCaption is a system that helps a smartphone user
generate a caption for their photos. It operates by uploading the photo to
a cloud service where a number of parallel modules are applied to
recognize a variety of entities and relations. The outputs of the modules
are combined to generate a large set of candidate captions, which are
returned to the phone. The phone client includes a convenient user
interface that allows users to select their favorite caption, reorder,
add, or delete words to obtain the grammatical style they prefer. The user
can also select from multiple candidates returned by the recognition
modules.

System architecture. (Left) The smartphone client captures
the photo and uploads it along with associated metadata to the cloud
service. (Middle) The cloud service runs a number of processing modules in
parallel. The outputs of the processing modules are passed through a
fusion step and then to the text generator. The generated captions are
then personalized. (Right) The client receives the captions and allows the
user to pick and edit them before sharing.