Project page

Prerequisites

The framework does not include a regional proposal network implementation. A RoI proposal database pre-extracted using the py-faster-rcnn framework is available for download.

You need CUDA-compatible GPUs to run the framework. A CPU-compatible version will be released soon.

You need at least 320 GB of free space to store the processed VisualGenome image dataset. A training script that reads image files directly will be released in the future.
However, if you just want to test/visualize some sample predictions, you may download a subset of the processed dataset (mini-vg) following the instruction or the "Quick Start" section. The subset takes ~4GB of space.

Dependencies

To get started with the framework, install the following dependencies:

Dataset

The scene graph dataset used in the paper is the VisualGenome dataset, although the framework can work with any scene graph dataset if converted to the desired format. Please refer to the dataset README for further instructions on converting the VG dataset into the desired format or downloading pre-processed datasets.

The program saves a checkpoint to checkpoints/CHECKPOINT_DIRECTORY/ every 50000 iterations. Training a full model on a desktop with Intel i7 CPU, 64GB memory, and a TitanX graphics card takes around 20 hours. You may use tensorboard to visualize the training process. By default, the tf log directory is set to checkpoints/CHECKPOINT_DIRECTORY/tf_logs/.

Note that to reproduce the results as presented the paper, you have to evaluate the entire test set by setting the number of images to -1.
The evaluation process takes around 10 hours. Setting the evaluation mode to all is to evaluate the models on all three tasks, i.e., pred_cls, sg_cls, sg_det.
You can also set the evaluation mode to individual tasks.

Visualize a predicted scene graph

Follow the following steps to visualize a scene graph predicted by the model:

The viz_cls mode assumes ground truth bounding boxes and predicts the predicted object and relationship labels, which is of the same setting as the sg_cls task.
viz_det mode uses the proposed bounding box from the regional proposal network as the object proposals, which is of the same setting as the sg_det task.

Checkpoints

A TensorFlow checkpoint of the final model trained with 2 inference iterations: