Convolutional Neural Pyramid for Image Processing

Abstract

We propose a principled convolutional neural pyramid (CNP) framework for general low-level vision and image processing tasks. It is based on the essential finding that many applications require large receptive fields for structure understanding. But corresponding neural networks for regression either stack many layers or apply large kernels to achieve it, which is computationally very costly. Our pyramid structure can greatly enlarge the field while not sacrificing computation efficiency. Extra benefit includes adaptive network depth and progressive upsampling for quasi-realtime testing on VGA-size input. Our method profits a broad set of applications, such as depth/RGB image restoration, completion, noise/artifact removal, edge refinement, image filtering, image enhancement and colorization.

Compared to very successful classification and detection, low-level-vision neural networks encounter quite different difficulties when they are adopted as regression tools.

Figure 1: Receptive field size vs. running time of current low-level-vision CNNs, including SRCNN , Filter-CNN , VDSR , Shepard-CNN and Deconv-CNN . Our network achieves a very large receptive field efficiently, which enables many general learning applications. Running time is obtained during testing on a Nvidia Titan X graphics card with VGA-size color-image input.

(a) Depth/RGB Restoration

(b) Image Completion

(c) Artifact/Noise Removal

(d) Learning Image Filter

Size of Receptive Field for Image Processing The first problem is about the receptive field, which is the region in the input layer connected to an output neuron. A reasonably large receptive field can capture global information applied to inference. In VGG [40] and ResNet [14], a large receptive field is achieved mainly by stacking convolution and pooling layers. For low-level-vision CNNs [6], the pooling layers are commonly removed in order to regress the same-resolution output and preserve image details. To still obtain acceptable receptive fields with the convolution layers, the two solutions are to use large kernels (7×7 or 9×9) [6] and stack many layers [18].

We found unexceptionally these two schemes both make the system run slowly due to heavy computation and consume a lot of memory. To illustrate it, we show the relation between the size of a receptive field and running time of current low-level-vision CNNs in Figure Figure 1. It is observed based on this plot that most existing low-level-vision CNNs, such as SRCNN [6], Filter-CNN [46], VDSR [18], and Shepard-CNN [33], achieve receptive fields smaller than 41×41 (pixels). Their applications are accordingly limited to local-information regression ones such as super-resolution, local filtering, and inpainting. On the other hand, Deconv-CNN [45] uses 143×143 field size. But its computation cost is very high as shown in Figure 1.

Contrary to these common small receptive fields, our important finding is that large or even whole-image fields are essential for many low-level vision tasks because most of them, including image completion, restoration, colorization, matting, global filtering, are based on global-information optimization.

In this paper, we address this critical and general issue, and design a new network structure to achieve very-large receptive fields without sacrificing much computation efficiency, as illustrated in Figure 1.

Information Fusion The second difficulty is on multiscale information fusion. It is well known that most image content is with multiscale patterns as shown in Figure 2. As discussed in [50], the small-scale information such as edge, texture and corners are learned in early layers and object-level knowledge comes from late ones in a deep neural network. This analysis also reveals the fact that color and edge information vanishes in late hidden layers. For pixel labeling work, such as semantic segmentation [28] and edge detection [27], early feature maps were extracted and taken into late layers to improve pixel inference accuracy.

We address this issue differently for low-level-vision tasks that do not involve pooling. Although it seems that the goal of retaining early-stage low-level edge information contradicts large receptive field generation in a deep network, our solution shows they can accomplished simultaneously with a general network structure.

Our Approach and Contribution To address aforementioned difficulties, we propose convolutional neural pyramid (CNP), which can achieve quite large receptive fields while not losing efficiency as shown in Figure Figure 1. It intriguingly enables multiscale information fusion for general image tasks. The convolutional neural pyramid structure is illustrated in Figure Figure 3 where a cascade of features are learned in two streams. The first stream across different pyramid levels plays an important role for enlarging the receptive field. The second stream learns information in each pyramid level and finally merges it to produce the final result.

Our framework is very efficient in computation. On an Nvidia Titan X display card, 28 frames are processed per second for a QVGA-size input and 9 frames per second for a VGA-size input for all image tasks.

Figure 3: Illustration of our convolutional neural pyramid. (a) shows the convolutional pyramid structure. (b) and (c) are the feature extraction and mapping components respectively. Conv(x,y) denotes the convolution operation, where x is the kernel size and y is the number of output.

We review recent progress of using CNNs for computational photography and low-level computer vision.

Inverse Problems CNNs were used to solve inverse problems such as image super-resolution/upsampling and deconvolution. Representative methods include SRCNN [6], very-deep CNNs [18], deeply-recursive CNNs [19], and FSRCNN [7]. The major scheme is to regress the inverse degradation function by stacking many CNN layers, such as convolution and ReLU. They are mostly end-to-end frameworks. Image super-resolution frameworks were extended to video/multi-frame super-resolution [25] and depth upsampling [16]. All these frameworks are with relatively small receptive fields, which are hard to be extended to general image processing tasks.

Another important topic is image deconvolution, which needs image priors, such as gradient distribution, to constrain the solution. Xu [45] demonstrated that simple CNNs can well approximate the inverse kernel and achieved decent results. As shown in Figure 1, large kernels for convolution require heavy computation.

Representative CNNs for High-level Tasks Many CNNs are proposed for high-level recognition tasks. Famous ones include AlexNet [23], VGG [40], GoogleLeNet [41] and ResNet [15]. Based on these frameworks, lots of networks are proposed to solve semantic image segmentation as pixel classification. They include FCN [28], DeepLab [3], SegNet [1], U-Net [34] and those of [49]. Compared with these image-to-label or image-to-label-map frameworks, our CNP aims for image processing directly by image-to-image regression.

To achieve large receptive fields, our convolutional neural pyramid network is illustrated in Figure 3. It includes levels from 0 to N−1 as shown in Figure 3(a) where N is the number of levels. We denote these levels as Li where i∈{0⋅⋅⋅N−1}. Different-scale content is encoded in them. The feature extraction, mapping and reconstruction operations are denoted as Fi, Mi and Ri respectively in each pyramid level. The input to Li is the feature extracted from Li−1 after downsampling, essential to enlarge the receptive-field.

Specifically, the input to L0 is the raw data, such as RGB channels. After feature extraction and mapping in each level, information is reconstructed from LN−1 to L0 where the output of each Li is upsampled and fused with the output from Li−1. Thus scale-sensitive information is used together to reconstruct the final output.

Before going to details for each component, we explain that our CNP is fundamentally different from and superior to the intuitive multiscale fusion strategy regarding the overall structure where the latter scheme processes each-scale input by feature extraction, mapping and reconstruction respectively and sums these outputs into the final result. There are two major differences in our network structure.

Adaptive Network Depth in Levels First, in the original resolution, i.e., level 0 in Figure 3(a), only two convolution layers are performed for feature extraction. It is a shallow network that can well capture the edge and color information as previously discussed – its complexity is low.

Then in level 1, after downsampling to reduce feature map size, another feature extraction module is performed, which involves more convolution layers for deeper processing of the input image. It begins to extract more complicated patterns beyond edges due to its higher depth and accordingly larger receptive fields. The complexity is not much increased with the reduced feature map size.

When the network contains more levels, they are with even smaller feature maps processed more deeply to form semantically meaningful information regarding objects and shapes. The receptive fields are further increased while the computation complexity is caped to a reasonably small value. This adaptive depth strategy is remarkably effective in a lot of tasks. We will show experimental evaluation later.

Progressive Upsampling During the final upsampling phase (after mapping) in Figure 3(a), directly upsampling the output from level i to level 0 needs a large kernel since the upsampling ratio is 2i, which makes learning difficult and possibly inaccurate. We address this issue by a progressive scheme from level N−1 to 0. For each phase, the output in level i is upsampled and then fused with the output in level i−1. Thus, the ratio is only 2 where 3×3 efficient upsampling kernel suffices. Further, information in level i−1 is upsampled with the guidance from level i since the upsampling kernel is learned between the two neighboring levels. It avoids possible edge and feature distortion when reconstructing the final result.

The empirical comparison with the simple multiscale fusion strategy will be provided after explaining all following components in our network.

Feature Extraction We apply convolution layers to extract image features. As shown in Figure 3(b), two convolution layers with PReLU rectification are employed for each feature extraction module. The low-level image features such as edges and corners can be extracted by one module [50] in level 0. Semantically more meaningful information is constructed with more extraction modules regarding our adaptive depth in other levels.

Each convolution layer outputs 56 features with a 3×3 kernel. For level 0, its feature output is a 56-channel map. Other levels take the input of feature extracted from another level after downsampling. Thus, 2∗(i+1) convolution layers are applied to feature extraction in level Li, where i is the index, to learn complicated features. This pyramid design effectively enlarges the receptive field without quickly increasing computation. It also nicely encodes different-scale image information.

Mapping The extracted features in each level Li are transformed by mapping Mi. As demonstrated in Figure Figure 3(c), our mapping includes shrinking, nonlinear transform and expanding blocks, motivated by the work of [7]. The shrinking layer embeds the 56-channel feature map into 12 channels by 1×1 convolution. It reduces the feature dimension and benefits noise reduction.

The network is then stacked by nonlinear transform layers. Instead of applying large-kernel convolution [45] to achieve large receptive fields, we stack S convolution layers with kernel size 3×3 to reduce parameters and empirically achieve comparable performance. Since nonlinear transform is conducted in a low-dimension feature-space, computation efficiency is further improved. Here S ranges from 1−3. We will discuss its influence later.

Finally, an expanding layer is added to remap the 12-channel feature map into 56 channels by 1×1 convolution, for reverse of the shrinking operation. We use this layer to gather more information to reconstruct high quality results.

Reconstruction The reconstruction operation fuses information from two neighboring levels. For Li and Li+1, the output of Li+1 is upsampled and then fused with the output from Li. The goal is to merge different-scale information progressively. We test two fusion operations of concatenating two outputs and element-wise sum of the outputs. They achieve comparable performance in our experiments. We thus use element-wise sum for a bit higher efficiency.

Down- and Up-sampling Down- and up-sampling are key components in pyramid construction and collapse. The downsampling operation shown in Figure 3(a) resizes extracted feature maps. For simplicity, we only consider resizing ratio 0.5 in our network. Two downsampling schemes are tested – max pooling and convolution with a 3×3 kernel with stride 2. The simple max pooling works better in experiments due to preservation of the max response. Note that in level 0, our network does not include any pooling layer to keep image details. We simply implement the upsampling operation as a deconvolution layer in Caffe [17].

Adjustment Since the reconstructed feature maps in L0 are with 56 channels, to adjust difference between the feature map and output, we use two convolution layers with kernel size 3×3 to generate the same number of channels as the final output, as shown in Figure 3(a). Before each convolution layer, a PReLU layer is applied.

Algorithm ? gives the overall procedure, Fi(X) and Mi(X) are output of X after operations Fi and Mi respectively. Ri(X,X1) is the reconstruction result of X and X1 with operation Ri. Configuration of our network is provided in the supplementary file.

The quite large receptive field, adaptive depth, and progressive fusion achieved by our CNP greatly improve performance for many tasks. We perform extensive evaluation and give the following analysis.

Receptive Field vs. Running Time In our N-level network, the information passing through level LN−1 has the largest receptive field. For this level, the images are processed by feature extraction and downsampling for N passes, which introduce a total of 2∗N feature extraction operations. The receptive field size increases exponentially with the number of levels, making the largest receptive field 2N times of the original one. Note that increasing the pyramid level introduces limited extra computation since the size of feature map decreases quickly in levels. The total extra computation for level N−1 is only 1/2N of that in level 0 for feature extraction.

The receptive field and running time statistics are reported in Table Table 1. For 5 levels, our testing time is only about 3 times of that spent in level 0 including overhead in up- and down-sampling. Note that there is only one feature extraction module in level 0 while level 4 has 5 of them. Our effective receptive field is as large as 511×511 pixels.

We compare our CPN model with a baseline “Single-Level CNN” in Table 1 that does not contain any pyramid levels – so its structure is similar to level 0 of our model. In order to get 95×95 receptive field, 48 3×3 convolution layers are needed, which consumes nearly 9GB GPU memory. Similar performance is yielded for our 3-level system.

Table 1: Time and memory consumption analysis in terms of different sizes of receptive fields. “Single-Level CNN” denotes the baseline model without pyramid levels. VGA-size input images are used in experiments. Our GPU memory does not allow 112 or 256 layers.

Comparison with Simple Multiscale Fusion We compare our framework with simple multiscale fusion described in Section 3.1 under the same parameter setting and report the results in Table 2 for depth/RGB image restoration on NYU Depth V2 benchmark [39]. Our framework outperforms this alternative under all levels. More pyramid levels yield larger improvement because we adapt network depth in levels while simple multiscale fusion does not.

Effectiveness of the Pyramid We use the same depth/RGB restoration benchmark to verify the effectiveness of our network because this task needs large-region information in different scales to fill in holes and remove artifacts. More details of the task are included in Section 4. We evaluate our network under different setting and report the results in Table 3. They indicate involving several pyramid levels can greatly improve depth/RGB restoration quality since it benefits from afore discussed large receptive field sizes. We also test the influence of S, which is the number of the nonlinear transform layers. Setting it to 1−3 does not significantly change the result since the receptive field is not much influenced.

Visualizing What CNP Learns We visualize the information CNP learns regarding its different pyramid levels for depth restoration. We block some-level information to understand how it proceeds. The model is with 5 levels and 1 transform layer each. We first block levels L1−L4 and only show the result inferred from L0 in Figure 4(d) – the small receptive field of L0 is good to remove small holes. For large ones, artifacts are generated.

Then we gradually include more levels and show results in Figure Figure 4(e-h). It is obvious that the capacity of removing holes increases as the total pyramid level expands. The biggest region on the chair is completely restored when all 5 levels are used. This manifests that these pyramid levels capture different degrees of semantic information. They work best on particular structure respectively. Our fusion process makes good use of all of them.

Relation to Previous Methods Traditional convolution pyramids [9] is only to approximate large-kernel convolution by a series of small ones. It is limited to gradient interpolation and does not extend to learning easily. Our network is obviously different due to the new structure and ability to learn general low-level vision operations.

When the number of our pyramid level is one, our network is similar to the network for image super-resolution [7]. We note our contribution is exactly on the pyramid structure and the effective adaptive depth for large receptive field generation. It can handle a large variety of complex tasks, which the method of [7] does not. We show these applications in what follows.

Difference from Detection and Segmentation CNNs Compared with popular CNNs such as VGG [40], ResNet [14] and FCN [28], the key difference is that our framework directly handles image-to-image regression for image processing in a very efficient way. Although some semantic segmentation frameworks, such as RefineNet [26], U-Net [34], and FRRN [31], also consider multi-scale information, they are fundamentally different due to the diverse nature of classification and regression. None of these methods can be directly applied (or through slight modification with regression loss) to our low-level vision tasks.

We implement our network in Caffe [17] - training and testing are on a single NVIDIA Titan X graphics card. For most applications, we set N=5 and S=1 by default. Following [18], we perform residual learning for all tasks. During training, 81×81 patch size and 32 batch size are applied. We use learning rate 10−5 for all model training. In general, training with 2 to 8 epoches is enough.

We set the training loss as the combination of L2 loss in intensity and gradient. The reason is that intensity loss retains appearance while the gradient loss has the ability to keep results sharp. The loss is expressed as L(X,^X)=∥X−^X∥+λ∥∇X−∇^X∥, where X is the prediction and ^X is the ground truth. ∇ is the gradient operator and ∥⋅∥ computes the L2 norm distance. λ is the parameter balancing the two losses, set to 1 in all experiments. Our framework can also use other loss functions, such as the generative loss. Exploring them will be our future work.

Depth/RGB Image Restoration Depth map restoration with guidance of RGB images is an important task since depth captured by cameras such as Microsoft Kinect, is often with holes and severe noise. Depth estimated by stereo matching also contains errors in textureless and occlusion regions. Refining the produced depth is therefore needed for 3D reconstruction and geometry estimation. Our framework can address this problem.

We evaluate our method on NYU Depth V2 dataset [39], where depth is captured by Microsoft Kinect. It includes 1,449 aligned depth/RGB pairs. In our experiment, we treat the provided processed depth maps as ground truth and randomly select 10% image pairs for testing and the remaining 90% for training. We simply stack the gray-scaled RGB image, depth map, and the mask that indicates location of missing values as the input to the network. The output is the refined depth map.

Comparison of different methods is reported in Table 4. For bilateral filter [42], guided image filtering [13], and weighted median filter [52], we first fill in the holes using joint-bilateral-filter-based normalized convolution [20] and then apply the filters to refine depth. We implement the method of [29] given no public code. The lower PSNRs by filter-based methods in Table 4 are due to big holes. Methods of Lu [29] and mutual-structure filtering [38] achieve better results; but they face similar difficulties to restore large regions. our method produces high-quality results by properly handling different-scale holes.

We also evaluate our framework on the synthetic dataset [29]. Since there is no training data provided, we collect it from Middleburry stereo 2014 dataset [36] and MPI Sintel stereo dataset [2]. We degrade depth maps by adding holes and noise according to the method of [29]. Our model setting and training are the same as those on NYU Depth V2 dataset. The results are listed in Table 4 where all methods achieve good performance. It is because the synthetic data is simple. On the one hand, the reference image is with high quality. On the other hand, holes are relatively small.

In the visual comparison in Figure 5, our result is with sharp boundaries and many details. More results are provided in our supplementary file.

Table 4: Comparisons of different depth/RGB restoration methods on the NYU Depth V2 and Lu datasets. We report average PSNRs of each method on the testing dataset.

Methods

NYU Depth V2 Data

Lu Data

Bilateral Filter

26.19

34.33

Guided Filter

27.02

34.86

Weighted Median

31.19

38.17

Lu

34.53

39.20

Mutual-Structure

33.97

39.13

Ours

39.42

39.21

Natural Image Completion We apply our method to image completion. For evaluation, we use the portrait dataset [37] since the portraits are with rich structure and details. For each portrait, we randomly set 5% pixels visible and dilate these visible regions by 4 pixels. A degraded image is shown as Figure ?(a). The input to our framework is the degraded color image. We train our model with 1,800 portraits from [37] with our degradation. One example is shown in Figure ? where (a) is the input. In Figure ?(b), the result of normalized convolution with bilateral filter [20] is a bit burry. The result in (c) is our trained model of [33] using our data. This method has difficulty to complete large holes because of the limited-size receptive field. PatchMatch-based method [22] produces result (d). Our result in (e) is a well restored image. The quantitative evaluation using the testing dataset from [37] with our degradation in Table 5 also manifests our superior performance.

Learning Image Filter Our framework also benefits image filter learning. Similar to those of [46] and [27], we use our deep convolution pyramid to learn common edge-preserving image filters, including weighted least square filter (WLS) [10], L0[44], RTV [47], weighted median filter (WMF) [52], and rolling guidance filter (RGF) [51]. These filters are representative ones with either global optimization or local computation. We employ the images from Imagenet [35] to train our models. The input to our network is color channels and the output is filtered color images. We quantitatively evaluate our method and previous approaches [46] on the testing dataset of [46] by measuring PSNRs. As reported in Table 6, our framework outperforms that of [46] and achieves comparable performance to state-of-the-art method [27]. It is noteworthy that our testing is 31% faster than that of [27] on the same hardware.

Table 6: Performance of our framework for learning image filters. PSNRs of and are cited from paper .

Filter Type

PSNRs of

PSNRs of

Our PSNRs

WLS

36.2

39.4

39.6

L0

32.8

30.9

32.6

RTV

32.1

37.1

38.0

WMF

31.6

34.0

39.3

RGF

35.9

42.2

42.6

Image Noise Removal Our framework can also be applied to image noise removal. We train our model on Imagenet data [35]. To get the noisy input close to real scenes we add Possion noise and Guassian white noise with deviation 0.01 to each image. Our model is directly trained on color images. One example is shown in Figure 6. Compared with previous BM3D [4] and CNN denoising [32] approaches shown in (b) and (c), our result contains less noise and well preserved details.

Figure 6: Comparison of image noise removal. (a) is the noisy input. (b) and (c) are the results of BM3D and CNN based denoising respectively. (d) is ours.

We have proposed a general and efficient convolution neutral network for low-level vision image processing tasks. It can produce very large receptive fields essential for many applications while not accordingly increase computation complexity. Our method is based on pyramid-style learning while introducing new adaptive depth mechanism. We have provided many examples to verify its effectiveness.