READING

Li discusses a simple 3D fully convolutional network for 3D object detections on KITTI []. The approach is considerably simple and the used architecture is summarized in Figure 1. A fully convolutional network naturally extends to the 3D domain, operating on an occupancy grid where each voxel is either occupied (equals $1$) or not. The output of the network is an objectness score per pixel (although the output is subsampled by $\frac{1}{4}^3$ compared to the input) and a bounding box per pixel. At testing time, the bounding boxes of positive object pixels are clustered to get a prediction (which corresponds to implicit non-maximum suppression). The bounding boxes are encoded as the corresponding corners in 3D. The model is evaluated on the KITTI dataset; without discussing the numbers, Figure 2 shows qualitative results.

Figure 1 (click to enlarge): Illustration of the simple network used for object detection. Unfortunately, details on the number of channels and filter sizes are missing.