Home

Automatic caption generation is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of an image. Practical applications of automatic caption generation include leveraging descriptions for image indexing or retrieval, and helping those with visual impairments by transforming visual signals into information that can be communicated via text-to-speech technology. However, one of the most critical limitations is limited understanding of the complex evaluation problem. Dataset and ground-truth labels are often biased and subjective, and automated evaluation metrics provide only a shallow view of the similarity between ground-truths and generated captions.

The goal of this workshop is two-fold: (a) coalescing community effort around a new challenging web-scale image-captioning dataset, and (b) formalizing a human evaluation protocol, which is expected to boost both evaluation reliability and efforts on automatic quality estimation of caption generation (i.e., algorithmic caption quality assessment in the absence of groundtruth). The dataset consisting of ~3.3 million image/caption pairs is already publicly available. We will employ a protocol to accurately estimate the quality of the image captions generated by the challenge participants, using both automatic metrics and human evaluators. We will compose a short-list of image captioning algorithms submitted to our challenge that have top performance (based on existing performance evaluation metrics), and then perform additional human evaluations to identify the best performers among them.

Beyond better understanding of the current state-of-the-art, our evaluation will allow us to observe correlation or discrepancy between automatic and human evaluation metrics. This will provide additional support for the creation of new automatic evaluation metrics that better reflect human judgments. In addition, we plan to release the resulting human judgments on caption quality (for a subset of the test set containing images with appropriate licence rights), with the goal of providing additional data for improving algorithmic methods for caption quality assessment in the absence of groundtruth captions (i.e., for arbitrary, non--caption-annotated images).

Conceptual Captions Challenge

The submissions will be in the form of a Docker image, and you can find precise instructions on how to create them in this link. Such a submission will be executed against the secret test set for this challenge, and it is expected to interact with the data based on a predefined API. We will evaluate the submissions in two stages. In the first stage, standard automatic metrics for the captioning task will be used to measure the quality of the captions: CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016), and ROUGE-L (Lin and Och, 2004). Based on the ranking of the submissions in the first stage, we will enter the top 10 submissions in the second evaluation stage (using a public test set), which will be done by human evaluators. The final official ranking for this task will be based on the human evaluation results.

The submission platform accepts two official submission for each participant (unlimited test submission can be attempted to make sure the submission mechanics work as intended).