Warehouse automation has attracted significant interest in recent years, perhaps most visibly by the Amazon Picking Challenge (APC). Achieving a fully autonomous pick-and-place system requires a robust vision system that reliably recognizes objects and their 6D poses. However, a solution eludes the warehouse setting due to cluttered environments, self-occlusion, sensor noise, and a large variety of objects. In this paper, we present a vision system that took 3rd- and 4th- place in the stowing and picking tasks, respectively at APC 2016. Our approach leverages multi-view RGB-D data and data-driven, self-supervised learning to overcome the aforementioned difficulties. More specifically, we first segment and label multiple views of a scene with a fully convolutional neural network, and then fit pre-scanned 3D object models to the resulting segmentation to get the 6D object pose. Training a deep neural network for segmentation typically requires a large amount of training data with manual labels. We propose a self-supervised method to generate a large labeled dataset without tedious manual segmentation that could be scaled up to more object categories easily. We demonstrate that our system can reliably estimate the 6D pose of objects under a variety of scenarios.

Supplementary Video

Code

Datasets

Our paper references two datasets (both available for download):
• "Shelf & Tote" Benchmark Dataset for 6D Object Pose Estimation • Automatically Labeled Object Segmentation Training Dataset
C++/Matlab code used to load the data can be found in our Github repository here (see rgbd-utils).
Both datasets share the same file structure, and contain APC-flavored scenes of shelf bins and totes, captured using an Intel® RealSense™ F200 RGB-D Camera. Each scene includes an entire tote or a single shelf bin, which can hold one or more APC objects in various orientations. Each scene is captured from different camera viewpoints. In particular, there 18 viewpoints for the tote and 15 viewpoints for the shelf bins.
In terms of file structure, the RGB-D camera sequence for each scene is saved into a corresponding folder by its name (‘scene-0000’, ‘scene-0001’, etc.). The folder contents are as follows:
scene-XXXX

• camera.info.txt - a text file that holds information about the scene and RGB-D camera. This information includes the environment (‘shelf’ or ‘tote’, bin ID if applicable), a list of objects in the scene (labeled by APC object ID), 3x3 camera intrinsics (for both the color and depth sensors), 4x4 camera extrinsics (to align the depth sensor to the color sensor), and 4x4 camera poses (camera-to-world coordinates) for each viewpoint. All matrices are saved in homogenous coordinates and in meters.

• frame-XXXXXX.depth.png - a 16-bit PNG depth image captured from the Realsense camera, aligned to its corresponding color image. Depth is saved in deci-millimeters (10-4m). Invalid depth is set to 0. For visualization, the bits per pixel have been circularly shifted to the right by 3 bits. This frame is derived from its raw counterpart (saved in folder ‘raw’) and typically contains less information.

• raw/frame-XXXXXX.depth.png - a raw 16-bit PNG depth image captured from the Realsense camera, NOT aligned to its corresponding color image. Depth is saved in deci-millimeters (10-4m). Invalid depth is set to 0. For visualization, the bits per pixel have been circularly shifted to the right by 3 bits.

calibration - a set of calibration scenes and pre-calibrated relative camera poses for the setup in the office environment. A calibration scene consists of an empty tote or shelf bin, overlaid with textured images rich with 2D features for Structure-from-Motion.

empty - a set of scenes for an empty (no object) setup in the office environment.

warehouse environment

competition - 25 scenes captured in the warehouse environment at the APC location in Germany, recorded during our final competition runs.

practice - 161 scenes captured in the warehouse environment at the APC location in Germany, recorded during our practice runs.

calibration - a set of calibration scenes and pre-calibrated relative camera poses for the setup in the warehouse environment. A calibration scene consists of an empty tote or shelf bin, overlaid with textured images rich with 2D features for Structure-from-Motion.

empty - a set of scenes for an empty (no object) setup in the warehouse environment.

Each scene (in addition to the files described here), contains:
scene-XXXX

Calibration Data

The camera poses of the RGB-D sequences in the dataset are retrieved from the robot’s millimeter-accurate localization software. However, small errors in camera-to-robot calibration and RGB-D camera intrinsics can cumulate to cause larger errors in the camera poses. Consequently, these errors can influence the quality of the point clouds created with multi-view reconstruction. To minimize the damage from these errors, we employ a calibration procedure that re-estimates camera poses using Structure from Motion. Each subset (office, warehouse) of the benchmark dataset contains the scenes we used for calibration (saved into a folder called 'calibration'), as well as a pre-computed set of calibrated relative camera poses for the tote (cam.pose.txt) and for each bin (cam.poses.X.txt, by bin ID). See rgbd-utils/demo.m and rgbd-utils/loadCalib.m from our Github repository for more information on how to use the calibration data.

Ground Truth 6D Object Pose Labels

Information about the dataset and its ground truth labels are provided as Matlab .mat files with the following variables:
scenes.mat - (scenes) 1x452 cell array of file paths to all scene directories in the dataset.
labels.mat - (labels) 2087x1 cell array of ground truth object labels, each saved as a struct with the following properties:

Notes on object symmetry labels: a value of 0 indicates that the object model has no geometric symmetry when rotating around its specific axis. A value of 90 indicates that the object model is geometrically similar when rotated 90 degrees around its specific axis. Likewise, a value of 180 indicates that the object model is geometrically similar when rotated 180 degrees around its specific axis. A value of 360 indicates that the object model is radially symmetric around its specific axis. The pre-scanned object models for all 39 APC objects can be downloaded from our Github repository here (see ros-packages/catkin_ws/src/pose_estimation/src/models/objects).
Code used to evaluate against the ground truth labels can be found in our Github repository here (see evaluation).

Object Segmentation Training Dataset

Full Training Dataset Download: training.zip (131.4 GB)
If you just want to try out a small portion of the data, you can download the sampler below:
Sampler Download: training-sample.zip (93.5 MB)
The object segmentation training dataset contains 136,575 RGB-D images of single objects (from the APC) in the shelf and tote. There are a total of 8,181 unique poses of 39 objects seen from various camera viewpoints. All images are labeled with binary foreground object masks, which were automatically generated to train the self-supervised deep models for 2D object segmentation. Details of the automatic labeling algorithm can be found in the paper. The training dataset also contains HHA maps (Gupta et al.), pre-computed from the depth images.
Each scene (in addition to the files described here), contains:
scene-XXXX

• HHA/frame-XXXXXX.HHA.png - a 24-bit PNG of HHA maps, an encoding of every aligned depth image into three channels at each pixel: horizontal disparity, height above ground, and the angle between the surface normal and the inferred gravity direction (Gupta et al.). All channels are linearly scaled to the 0 - 255 range.