Image-to-image translation with pix2pix

Conditional GANs (cGANs) may be used to generate one type of object based on another - e.g., a map based on a photo, or a color video based on black-and-white. Here, we show how to implement the pix2pix approach with Keras and eager execution.

What do we need to train a neural network? A common answer is: a model, a cost function, and an optimization algorithm. (I know: I’m leaving out the most important thing here - the data.)

As computer programs work with numbers, the cost function has to be pretty specific: We can’t just say predict next month’s demand for lawn mowers please, and do your best, we have to say something like this: Minimize the squared deviation of the estimate from the target value.

In some cases it may be straightforward to map a task to a measure of error, in others, it may not. Consider the task of generating non-existing objects of a certain type (like a face, a scene, or a video clip). How do we quantify success? The trick with generative adversarial networks (GANs) is to let the network learn the cost function.

As shown in Generating images with Keras and TensorFlow eager execution, in a simple GAN the setup is this: One agent, the generator, keeps on producing fake objects. The other, the discriminator, is tasked to tell apart the real objects from the fake ones. For the generator, loss is augmented when its fraud gets discovered, meaning that the generator’s cost function depends on what the discriminator does. For the discriminator, loss grows when it fails to correctly tell apart generated objects from authentic ones.

In a GAN of the type just described, creation starts from white noise. However in the real world, what is required may be a form of transformation, not creation. Take, for example, colorization of black-and-white images, or conversion of aerials to maps. For applications like those, we condition on additional input: Hence the name, conditional adversarial networks.

Put concretely, this means the generator is passed not (or not only) white noise, but data of a certain input structure, such as edges or shapes. It then has to generate realistic-looking pictures of real objects having those shapes. The discriminator, too, may receive the shapes or edges as input, in addition to the fake and real objects it is tasked to tell apart.

Here are a few examples of conditioning, taken from the paper we’ll be implementing (see below):

In this post, we port to R a Google Colaboratory Notebook using Keras with eager execution. We’re implementing the basic architecture from pix2pix, as described by Isola et al. in their 2016 paper(Isola et al. 2016). It’s an interesting paper to read as it validates the approach on a bunch of different datasets, and shares outcomes of using different loss families, too:

Prerequisites

The code shown here will work with the current CRAN versions of tensorflow, keras, and tfdatasets. Also, be sure to check that you’re using at least version 1.9 of TensorFlow. If that isn’t the case, as of this writing, this

library(tensorflow)
install_tensorflow()

will get you version 1.10.

When loading libraries, please make sure you’re executing the first 4 lines in the exact order shown. We need to make sure we’re using the TensorFlow implementation of Keras (tf.keras in Python land), and we have to enable eager execution before using TensorFlow in any way.

No need to copy-paste any code snippets - you’ll find the complete code (in order necessary for execution) here: eager-pix2pix.R.

Dataset

Images contain the ground truth - that we’d wish for the generator to generate, and for the discriminator to correctly detect as authentic - and the input we’re conditioning on (a coarse segmention into object classes) next to each other in the same file.

Preprocessing

Obviously, our preprocessing will have to split the input images into parts. That’s the first thing that happens in the function below.

After that, action depends on whether we’re in the training or testing phases. If we’re training, we perform random jittering, via upsizing the image to 286x286 and then cropping to the original size of 256x256. In about 50% of the cases, we also flipping the image left-to-right.

In both cases, training and testing, we normalize the image to the range between -1 and 1.

Note the use of the tf$image module for image -related operations. This is required as the images will be streamed via tfdatasets, which works on TensorFlow graphs.

Streaming the data

The images will be streamed via tfdatasets, using a batch size of 1. Note how the load_image function we defined above is wrapped in tf$py_func to enable accessing tensor values in the usual eager way (which by default, as of this writing, is not possible with the TensorFlow datasets API).

Defining the actors

Generator

First, here’s the generator. Let’s start with a birds-eye view.

The generator receives as input a coarse segmentation, of size 256x256, and should produce a nice color image of a facade. It first successively downsamples the input, up to a minimal size of 1x1. Then after maximal condensation, it starts upsampling again, until it has reached the required output resolution of 256x256.

During downsampling, as spatial resolution decreases, the number of filters increases. During upsampling, it goes the opposite way.

How can spatial information be preserved if we downsample all the way down to a single pixel? The generator follows the general principle of a U-Net(Ronneberger, Fischer, and Brox 2015), where skip connections exist from layers earlier in the downsampling process to layers later on the way up.

Here, the inputs to self$up are x14, which went through all of the down- and upsampling, and x1, the output from the very first downsampling step. The former has resolution 64x64, the latter, 128x128. How do they get combined?

That’s taken care of by upsample, technically a custom model of its own. As an aside, we remark how custom models let you pack your code into nice, reusable modules.

x14 is upsampled to double its size, and x1 is appended as is. The axis of concatenation here is axis 4, the feature map / channels axis. x1 comes with 64 channels, x14 comes out of layer_conv_2d_transpose with 64 channels, too (because self$up7 has been defined that way). So we end up with an image of resolution 128x128 and 128 feature maps for the output of step x15.

Downsampling, too, is factored out to its own model. Here too, the number of filters is configurable.

Discriminator

Again, let’s start with a birds-eye view. The discriminator receives as input both the coarse segmentation and the ground truth. Both are concatenated and processed together. Just like the generator, the discriminator is thus conditioned on the segmentation.

What does the discriminator return? The output of self$last has one channel, but a spatial resolution of 30x30: We’re outputting a probability for each of 30x30 image patches (which is why the authors are calling this a PatchGAN).

The discriminator thus working on small image patches means it only cares about local structure, and consequently, enforces correctness in the high frequencies only. Correctness in the low frequencies is taken care of by an additional L1 component in the discriminator loss that operates over the whole image (as we’ll see below).

Losses and optimizer

As we said in the introduction, the idea of a GAN is to have the network learn the cost function. More concretely, the thing it should learn is the balance between two losses, the generator loss and the discriminator loss. Each of them individually, of course, has to be provided with a loss function, so there are still decisions to be made.

For the generator, two things factor into the loss: First, does the discriminator debunk my creations as fake? Second, how big is the absolute deviation of the generated image from the target? The latter factor does not have to be present in a conditional GAN, but was included by the authors to further encourage proximity to the target, and empirically found to deliver better results.

The discriminator loss looks as in a standard (un-conditional) GAN. Its first component is determined by how accurately it classifies real images as real, while the second depends on its competence in judging fake images as fake.

Training is a loop over epochs with an inner loop over batches yielded by the dataset. As usual with eager execution, tf$GradientTape takes care of recording the forward pass and determining the gradients, while the optimizer - there are two of them in this setup - adjusts the networks’ weights.

Every tenth epoch, we save the weights, and tell the generator to have a go at the first example of the test set, so we can monitor network progress. See generate_images in the companion code for this functionality.

The results

Here’s a pretty typical result from the test set. It doesn’t look so bad.

Here’s another one. Interestingly, the colors used in the fake image match the previous one’s pretty well, even though we used an additional L1 loss to penalize deviations from the original.

This pick from the test set again shows similar hues, and it might already convey an impression one gets when going through the complete test set: The network has not just learned some balance between creatively turning a coarse mask into a detailed image on the one hand, and reproducing a concrete example on the other hand. It also has internalized the main architectural style present in the dataset.

For an extreme example, take this. The mask leaves an enormous lot of freedom, while the target image is a pretty untypical (perhaps the most untypical) pick from the test set. The outcome is a structure that could represent a building, or part of a building, of specific texture and color shades.

Conclusion

When we say the network has internalized the dominant style of the training set, is this a bad thing? (We’re used to thinking in terms of overfitting on the training set.)

With GANs though, one could say it all depends on the purpose. If it doesn’t fit our purpose, one thing we could try is training on several datasets at the same time.

Again depending on what we want to achieve, another weakness could be the lack of stochasticity in the model, as stated by the authors of the paper themselves. This will be hard to avoid when working with paired datasets as the ones used in pix2pix. An interesting alternative is CycleGAN(Zhu et al. 2017) that lets you transfer style between complete datasets without using paired instances:

Finally closing on a more technical note, you may have noticed the prominent checkerboard effects in the above fake examples. This phenomenon (and ways to address it) is superbly explained in a 2016 article on distill.pub(Odena, Dumoulin, and Olah 2016). In our case, it will mostly be due to the use of layer_conv_2d_transpose for upsampling.

As per the authors (Odena, Dumoulin, and Olah 2016), a better alternative is upsizing followed by padding and (standard) convolution. If you’re interested, it should be straightforward to modify the example code to use tf$image$resize_images (using ResizeMethod.NEAREST_NEIGHBOR as recommended by the authors), tf$pad and layer_conv2d.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".