Image denoising: Can plain Neural Networks compete with BM3D?

Transcription

1 Image denoising: Can plain Neural Networks compete with BM3D? Harold C. Burger, Christian J. Schuler, and Stefan Harmeling Max Planck Institute for Intelligent Systems, Tübingen, Germany Abstract Image denoising can be described as the problem of mapping from a noisy image to a noise-free image. The best currently available denoising methods approximate this mapping with cleverly engineered algorithms. In this work we attempt to learn this mapping directly with a plain multi layer perceptron (MLP) applied to image patches. While this has been done before, we will show that by training on large image databases we are able to compete with the current state-of-the-art image denoising methods. Furthermore, our approach is easily adapted to less extensively studied types of noise (by merely exchanging the training data), for which we achieve excellent results as well. 1. Introduction An image denoising procedure takes a noisy image as input and outputs an image where the noise has been reduced. Numerous and diverse approaches exists: Some selectively smooth parts of a noisy image [25, 26]. Other methods rely on the careful shrinkage of wavelet coefficients [24, 18]. A conceptually similar approach is to denoise image patches by trying to approximate noisy patches using a sparse linear combination of elements of a learned dictionary [1, 4]. Learning a dictionary is sometimes accomplished through learning on a noise-free dataset. Other methods also learn a global image prior on a noise-free dataset, for instance [20, 27, 9]. More recent approaches exploit the non-local statistics of images: Different patches in the same image are often similar in appearance [3, 13, 2]. This last class of algorithms and in particular BM3D [3] represent the current state-of-the-art in natural image denoising. While BM3D is a well-engineered algorithm, could we also automatically learn an image denoising procedure purely from training examples consisting of pairs of noisy and noise-free patches? This paper will show that it is indeed possible to achieve state-of-the-art denoising performance with a plain multi layer perceptron (MLP) that maps noisy patches onto noise-free ones. This is possible because the following factors are combined: The capacity of the MLP is chosen large enough, i.e. it consists of enough hidden layers with sufficiently many hidden units. The patch size is chosen large enough, i.e. a patch contains enough information to recover a noise-free version. This is in agreement with previous findings [12]. The chosen training set is large enough. Training examples are generated on the fly by corrupting noisefree patches with noise. Training high capacity MLPs with large training sets is feasible using modern Graphics Processing Units (GPUs). Contributions: We present a patch-based denoising algorithm that is learned on a large dataset with a plain neural network. Results on additive white Gaussian (AWG) noise are competitive with the current state of the art. The approach is equally valid for other types of noise that have not been as extensively studied as AWG noise. 2. Related work Neural networks have already been used to denoise images [9]. The networks commonly used are of a special type, known as convolutional neural networks (CNNs) [10], which have been shown to be effective for various tasks such as hand-written digit and traffic sign recognition [23]. CNNs exhibit a structure (local receptive fields) specifically designed for image data. This allows for a reduction of the number of parameters compared to plain multi layer perceptrons while still providing good results. This is useful when the amount of training data is small. On the other hand, multi layer perceptrons are potentially more powerful than CNNs: MLPs can be thought of as universal function approximators [8], whereas CNNs restrict the class of possible learned functions. A different kind of neural network with a special architecture (i.e. containing a sparsifying logistic) is used in [19] to denoise image patches. A small training set is used. Results are reported for strong levels of noise. It has also been 4321

2 attempted to denoise images by applying multi layer perceptrons on wavelet coefficients [28]. The use of wavelet bases can be seen as an attempt to incorporate prior knowledge about images. Differences to this work: Most methods we have described make assumptions about natural images. Instead we do not explicitly impose such assumptions, but rather propose a pure learning approach. 3. Multi layer perceptrons (MLPs) A multi layer perceptron (MLP) is a nonlinear function that maps vector-valued input via several hidden layers to vector-valued output. For instance, an MLP with two hidden layers can be written as, f(x) = b 3 + W 3 tanh(b 2 + W 2 tanh(b 1 + W 1 x)). (1) The weight matrices W 1, W 2, W 3 and vector-valued biases b 1, b 2, b 3 parameterize the MLP, the function tanh operates component-wise. The architecture of an MLP is defined by the number of hidden layers and by the layer sizes. For instance, a (256,2000,1000,10)-MLP has two hidden layers. The input layer is 256-dimensional, i.e. x R 256. The vector v 1 = tanh(b 1 + W 1 x) of the first hidden layer is 2000-dimensional, the vector v 2 = tanh(b 2 + W 2 v 1 ) of the second hidden layer is 1000-dimensional, and the vector f(x) of the output layer is 10-dimensional. Commonly, an MLP is also called feed-forward neural network Training MLPs for image denoising The idea is to learn an MLP that maps noisy image patches onto clean image patches where the noise is reduced or even removed. The parameters of the MLP are estimated by training on pairs of noisy and clean image patches using stochastic gradient descent [11]. More precisely, we randomly pick a clean patch y from an image dataset and generate a corresponding noisy patch x by corrupting y with noise, for instance with additive white Gaussian (AWG) noise. The MLP parameters are then updated by the backpropagation algorithm [21] minimizing the quadratic error between the mapped noisy patch f(x) and the clean patch y, i.e. minimizing (f(x) y) 2. To make backpropagation more efficient, we apply various common neural network tricks [11]: 1. Data normalization: The pixel values are transformed to have approximately mean zero and variance close to one. More precisely, assuming pixel values between 0 and 1, we subtract 0.5 and divide by Weight initialization: The weights are sampled from a normal distribution with mean 0 and standard deviation σ = N, where N is the number of input units of the corresponding layer. Combined with the first trick, this ensures that both the linear and the non-linear parts of the sigmoid function are reached. 3. Learning rate division: In each layer, we divide the learning rate by N, the number of input units of that layer. This allows us to change the number of hidden units without modifying the learning rate. The basic learning rate was set to 0.1 for all experiments. No regularization was applied on the weights Applying MLPs for image denoising To denoise images, we decompose a given noisy image into overlapping patches and denoise each patch x separately. The denoised image is obtained by placing the denoised patches f(x) at the locations of their noisy counterparts, then averaging on the overlapping regions. We found that we could improve results slightly by weighting the denoised patches with a Gaussian window. Also, instead of using all possible overlapping patches (stride size 1, i.e. patch offset 1), we found that results were almost equally good by using every third sliding-window patch (stride size 3), while decreasing computation time by a factor of 9. Using a stride size of 3, we were able to denoise images of size pixels in approximately one minute on a modern CPU, which is slower than BM3D [3], but much faster than KSVD [1] Implementation The computationally most intensive operations in an MLP are the matrix-vector multiplications. Graphics Processing Units (GPUs) are better suited for these operations than Central Processing Units (CPUs). For this reason we implemented our MLP on a GPU. We used nvidia s C2050 GPU and achieved a speed-up factor of more than one order of magnitude compared to an implementation on a quadcore CPU. This speed-up is a crucial factor, allowing us to run much larger-scale experiments. 4. Experimental setup We performed all our experiments on gray-scale images. These were obtained from color images with MATLAB s rbg2gray function. Since it is unlikely that two noise samples are identical, the amount of training data is effectively infinite, no matter which dataset is used. However, the number of uncorrupted patches is restricted by the size of the dataset. Training data: For our experiments, we define two training sets: Small training set: The Berkeley segmentation dataset [15], containing 200 images, and 4322

3 average PSNR [db] progress during training (AWG noise, σ=25) L 17 4x2047 L 13 4x2047 L 13 2x2047 L 17 2x2047 L 13 2x511 S 17 4x2047 S 13 2x number of training samples x 10 7 Figure 1. Improving average PSNR on the images Barbara and Lena while training. improvement in PSNR over BM3D [db] results compared to BM3D (AWG noise, σ=25) McGill VOC sorted image index Figure 3. Performance profile of our method on two datasets of 500 test images compared to BM3D. Large training set: The union of the LabelMe dataset [22] (containing approximately 150, 000 images) and the Berkeley segmentation dataset. Some images in the LabelMe dataset appeared a little noisy or a little blurry, so we downscaled the images in that dataset by a factor of 2 using MATLAB s imresize function with default parameters. Test data: We define three different test sets to evaluate our approach: Standard test images: This set of 11 images contains standard images, such as Lena and Barbara, that have been used to evaluate other denoising algorithms [3], Pascal VOC 2007: We randomly selected 500 images from the Pascal VOC 2007 test set [5], and McGill: We randomly selected 500 images from the McGill dataset [17]. 5. Results We first study how denoising performance depends on the MLP architecture and the number of training examples. Then we compare against BM3D and other existing algorithms, and finally we show how MLPs perform on other types of noise More training data and more capacity is better We train networks with different architectures and patch sizes. We write for instance L 17 4x2047 for a network that is trained on the large training set with patch size and 4 hidden layers of size 2047; similarly S 13 2x511 for a network that is trained on the small training set with patch size and 2 hidden layers of size 511. Other architectures are denoted in the legend of Figure 1. All these MLPs are trained on image patches that have been corrupted with Gaussian noise with σ = 25. To monitor the performance of the network we test the different networks after every two million training examples on two test images Barbara and Lena that have been corrupted with Gaussian noise with standard deviation σ = 25. Figure 1 shows the improving PSNR of the networks on the two test images. Observations: Many training examples are needed to achieve good results. Progress is steady for the first 40 million training samples. After that, the PSNR on the test images still improves, albeit more slowly. Overfitting never seems to be an issue. Better results can be achieved with patches of size compared to patches of size Also, more complex networks lead to better results. Switching from the small training set (Berkeley) to the large training set (LabeleMe + Berkeley) improves the results enormously. We note that most attempts to learn image statistics using a training dataset have used only the Berkeley segmentation dataset [20, 9, 19] Can MLPs compete with BM3D? In the previous section, the MLP L 17 4x2047 with four hidden layers of size 2047 and a patch size of trained on the large training set achieved the best results. We trained this MLP on a total of 362 million training samples, requiring approximately one month of computation time on a GPU. In the following, we compare its results achieved on the test data with other denoising methods, including BM3D [3]. Pascal VOC 2007, McGill: Figure 3 compares our method with BM3D on PASCAL VOC 2007 and McGill. To reduce computation time during denoising, we used a patch offset (stride size) of 3. On average, our results are equally good on the PASCAL VOC 2007 (better by 0.03dB) and on the McGill dataset (better by 0.08dB). More precisely, our MLP outperforms BM3D on exactly 300 of the 500 images of the PASCAL VOC 2007 images, see Figure 3. Similarly, our MLP is better on 347 of the 4323

4 clean (name: ) noisy (σ = 25)PSNR:20.16dB BM3D: PSNR:29.65dB ours: PSNR:30.03dB clean (name: barbara) noisy (σ = 25)PSNR:20.19dB BM3D: PSNR:30.67dB ours: PSNR:29.21dB Figure 2. Our results compared to BM3D. Our method outperforms BM3D on some images (top row). On other images however, BM3D achieves much better results than our approach. The images on which BM3D is much better than our approach usually contain some kind of regular structure, such as the stripes on Barbara s pants (bottom row). image GSM [18] KSVD [1] BM3D [3] us Barbara 27.83dB 29.49dB 30.67dB 29.21dB Boat 29.29dB 29.24dB 29.86dB 29.89dB C.man 28.64dB 28.64dB 29.40dB 29.32dB Couple 28.94dB 28.87dB 29.68dB 29.70dB F.print 27.13dB 27.24dB 27.72dB 27.50dB Hill 29.26dB 29.20dB 29.81dB 29.82dB House 31.60dB 32.08dB 32.92dB 32.50dB Lena 31.25dB 31.30dB 32.04dB 32.12dB Man 29.16dB 29.08dB 29.58dB 29.81dB Montage 30.73dB 30.91dB 32.24dB 31.85dB Peppers 29.49dB 29.69dB 30.18dB 30.25dB Table 1. PSNRs (in db) on standard test images, σ = images of the McGill images. Our best improvement over BM3D is 0.81dB on image pippin0120 ; BM3D is better by 1.32dB on image merry mexico0152, both in the McGill dataset. Standard test images: We also compare our MLP (with a stride size of 1) to BM3D on the set of standard test images, see Table 1. For BM3D, we report the average results for 105 different noisy instances of the same test image. Due to longer running times, we used only 17 different noisy instances for our approach. We outperform BM3D on 6 of the 11 test images. BM3D has a clear advantage on images with regular structures, such as the pants of Barbara. We do outperform KSVD [1] on every image except Barbara. KSVD is a dictionary-based denoising algorithm that learns a dictionary that is adapted to the noisy image at hand. Images with a lot of repeating structure are ideal for both BM3D and KSVD. We see that the neural network is able to compete with BM3D. 4324

5 average PSNR [db] behavior at different noise levels BM3D us, trained on several noise levels GSM KSVD BM3D, assuming σ = 25 us, trained on σ = σ noise Figure 4. Comparison on images with various noise levels: the MLP trained for σ = 25 is competitive for σ = 25. The MLP trained on several noise levels is also competitive on higher noise levels Robustness at other noise levels The MLP from the previous section was trained solely on image patches that were corrupted with AWG noise with σ = 25. Is it able to handle other noise levels (σ smaller or larger than 25) as well? To answer this question, we applied it to the 11 standard test images that were corrupted with AWG noise with different values of σ. Figure 4 shows a comparison against results achieved by GSM, KSVD and BM3D. We see that for σ = 25 our MLP (brown line) is competitive, but deteriorates for other noise levels. While our MLP does not know that the level of the noise has changed, the other methods were provided with that information. To study this effect we also run BM3D for the different noise levels but fixing its input parameter to σ = 25 (red curve). We see a similar behavior to our method. Our MLP generalizes even slightly better to higher noise levels (brown above red) MLPs trained on several noise levels To overcome the limitations of an MLP trained on examples from a single noise level, we attempted to train a network on image patches corrupted by noise with different noise levels. We used the same architecture as our network trained on σ = 25. The amount of noise of a given training patch (i.e. the value of σ) was given as additional input to the network. This was done in two ways: One additional input unit provided the value of σ directly; 15 additional input units worked as switches with all units set to 1 except for the one unit coding the corresponding value of σ. Training proceeded as previously and σ was chosen randomly in steps of 5 between 0 and 105. We tested this network on 11 standard test images for different values of σ, see green line in Figure 4. Even though we outperform BM3D on none of the noise levels, we do perform better than both GSM and KSVD at high noise levels. At low noise levels (σ = 5) our denoising results are worse than the noisy input. We draw the following conclusions: Denoising at several noise levels is more difficult than denoising at a single noise level. Hence, a network with more capacity (i.e. parameters) should be used. The fact that the network performs better at high noise levels is presumable due to the fact that noisier patches provide stronger gradients. The higher noise levels therefore dominate the training procedure. A potential solution might be to adapt the learning rate to the value of σ Learning to remove arbitrary noise types Virtually all denoising algorithms assume the noise to be AWG. However, images are not always corrupted by AWG noise. Noise is not necessarily additive, white, Gaussian and signal independent. For instance in some situations, the imaging process is corrupted by Poisson noise (such as photon shot noise). Denoising algorithms which assume AWG noise might be applied to such images using some image transform [14]. Similarly, Rice-distributed noise, which occurs in magnetic resonance imaging, can be handled [6]. In most cases however, it is more difficult or even impossible to find Gaussianizing transforms. In such cases, a possible solution is to create a denoising algorithm specifically designed for that noise type. MLPs allow us to effectively learn a denoising algorithm for a given noise type, provided that noise can be simulated. In the following, we present results on three noise types that are different from AWG noise.we make no effort to adapt our architecture or procedure in general to the specific noise type but rather use the architecture that yielded the best results for AWG noise (four hidden layers of size 2047 and patches of size 17 17). Stripe noise: It is often assumed that image data contains structure, whereas the noise is uncorrelated and therefore unstructured. In cases where the noise also exhibits structure, this assumption is violated and denoising results become poor. We here show an example where the noise is additive and Gaussian, but where 8 horizontally adjacent noise values have the same value. Since there is no canonical denoising algorithm for this noise, we choose BM3D as the competitor. An MLP trained on 58 million training examples outperformed BM3D for this type of noise, see left column of Figure 5. Salt and pepper noise: When the noise is additive Gaussian, the noisy image value is still correlated to the original image value. With salt and pepper noise, noisy values are not correlated with the original image data. Each pixel has a probability p of being corrupted. A corrupted pixel has probability 0.5 of being set to 0; otherwise, it is set to high- 4325

6 Figure 6. Random selection of weights in the output layer. Each patch represents the weights from one hidden neuron to the output pixels. Figure 7. Random selection of weights in the input layer. Each patch represents the weights from the input pixels to one hidden neuron. l1 norm / initial l1 norm Weight norm evolution during training layer 1 layer 2 layer 3 layer 4 layer iteration Figure 8. The l 1-norm of the weights of some layers decreases during training (without any explicit regularization). They can be categorized coarsely into four categories: 1) patches resembling Gabor filters, 2) blobs, 3) larger scale structures, and 4) noisy patches. The Gabor filters occur at different scales, shifts and orientations. Similar dictionaries have also been learned by other denoising approaches. It should be noted that MLPs are not shift-invariant, which explains why some patches are shifted versions of each other. The weights connecting the noisy input pixels to one hidden neuron in the first hidden layer can also be represented as an image patch, see Figure 7. The patches can be interpreted as filters, with the activity of the hidden neuron connected to a patch corresponding to the filter s response to the input. These patches can be classified into three main categories: 1) patches that focus on just a small number of pixels, 2) patches focusing on larger regions and resembling Gabor filters, and 3) patches that look like random noise. These filters are able to extract useful features from noisy input data, but are more difficult to interpret than the output layer patches. It is also interesting to observe the evolution of the l 1 - norm of the weights in the different layers during training, see Figure 8. One might be tempted to think of the evolution of the weights as following a random walk. In that case, the l 1 -norm should increase over time. However, we observe that in all layers but the first, the l 1 -norm decreases over time (after a short initial period where it increases). This happens in the absence of any explicit regularization on the weights and is an indication that such regularization is not necessary. MLPs vs. Support Vector Regression: We use MLPs to solve a regression problem to learn a denoising method. An equally valid approach would have been to use a kernel approach such as support vector regression (SVR). For practical (rather than fundamental) reasons we preferred MLPs over SVR: (i) MLPs are easy to implement on a GPU since they are based on matrix-vector-multiplications. (ii) MLPs can easily be trained on very large datasets using stochastic gradient descent. However, we make no claim regarding the quality of results potentially achievable with SVR: It is entirely possible that SVR would yield still better results than our MLPs. Is deep learning necessary? Training MLPs with many hidden layers can lead to problems such as vanishing gradients and over-fitting. To avoid these problems, new training procedures called deep learning that start with an unsupervised learning phase have been proposed [7]. Such an approach makes the most sense when labeled data is scarce but unlabeled data is plentiful and when the networks are too deep to be trained effectively with back-propagation. In our case, labeled data is plentiful and the networks contain no more than four hidden layers. We found backpropagation to work well and therefore concluded that deep learning techniques are not necessary, though it is possible that still better results are achievable with an unsupervised pre-training technique. 7. Conclusion Neural networks can achieve state-of-art image denoising performance. For this, it is important that (i) the capacity of the network is large enough, (ii) the patch size is large enough, and (iii) the training set is large enough. These requirements can be fulfilled by implementing MLPs 4327

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,

Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert

Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material

Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional

IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform