TextureGAN: Controlling Deep Image Synthesis with Texture Patches

Abstract

In this paper, we investigate deep image synthesis guided by sketch, color, and texture. Previous image synthesis methods can be controlled by sketch and color strokes but we are the first to examine texture control. We allow a user to place a texture patch on a sketch at arbitrary locations and scales to control the desired output texture. Our generative network learns to synthesize objects consistent with these texture suggestions. To achieve this, we develop a local texture loss in addition to adversarial and content loss to train the generative network. We conduct experiments using sketches generated from real images and textures sampled from a separate texture database
and results show that our proposed algorithm is able to generate plausible images that are faithful to user controls. Ablation studies show that our proposed pipeline can generate more realistic images than adapting existing methods directly.

One of the “Grand Challenges” of computer graphics is to allow anyone to author realistic visual content. The traditional 3d rendering pipeline can produce astonishing and realistic imagery, but only in the hands of talented and trained artists. The idea of short-circuiting the traditional 3d modeling and rendering pipeline dates back at least 20 years to image-based rendering techniques [33]. These techniques and later “image-based” graphics approaches focus on re-using image content from a database of training images [22]. For a limited range of image synthesis and editing scenarios, these non-parametric techniques allow non-experts to author photorealistic imagery.

In the last two years, the idea of direct image synthesis without using the traditional rendering pipeline has gotten significant interest because of promising results from deep network architectures such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). However, there has been little investigation of fine-grained texture control in deep image synthesis (as opposed to coarse texture control through “style transfer” methods).

In this paper we introduce TextureGAN, the first deep image synthesis method which allows users to control object texture. Users “drag” one or more example textures onto sketched objects in a scene and the network realistically applies these textures to the indicated objects.

This “texture fill” operation is difficult for a deep network to learn for several reasons: (1) Existing deep networks aren’t particularly good at synthesizing high-resolution texture details even without user constraints. Typical results from recent deep image synthesis methods are at low resolution (e.g. 64x64) where texture is not prominent or they are higher resolution but relatively flat (e.g. birds with sharp boundaries but few fine-scale details). (2) For TextureGAN, the network must learn to propagate textures to the relevant object boundaries – it is undesirable to leave an object partially textured or to have the texture spill into the background. To accomplish this, our network must implicitly segment the sketched objects and perform texture synthesis, tasks which are individually difficult. (3) The network should additionally learn to foreshorten textures as they wrap around 3d object shapes, to shade textures according to ambient occlusion and lighting direction, and to understand that some object parts (handbag clasps) are not to be textured but should occlude the texture. These texture manipulation steps go beyond traditional texture synthesis in which a texture is assumed to be stationary. To accomplish these steps the network needs a rich implicit model of the visual world that involves some partial 3d understanding.

Fortunately, the difficulty of this task is somewhat balanced by the availability of training data. Like recent unsupervised learning methods based on colorization [47, 23], training pairs can be generated from unannotated images. In our case, input training sketches and texture suggestions are automatically extracted from real photographs which in turn serve as the ground truth for initial training. We introduce local texture loss to further fine-tune our networks to handle diverse textures unseen on ground truth objects.

We make the following contributions:

We are the first to demonstrate the plausibility of fine-grained texture control in deep image synthesis. In concert with sketched object boundaries, this allows non-experts to author realistic visual content. Our network is feed-forward and thus can run interactively as users modify sketch or texture suggestions.

We explore novel losses for training deep image synthesis. In particular we formulate a local texture loss which encourages the generative network to handle new textures never seen on existing objects.

Image Synthesis.
Synthesizing natural images has been one of the most intriguing and challenging tasks in graphics, vision and machine learning research. Most of the existing approaches fall into non-parametric and parametric types. On one hand, non-parametric approaches have a long-standing history. They are typically data-driven or example-based, i.e., directly exploit and borrow existing image pixels for the target tasks [1, 3, 6, 13, 33]. Therefore, non-parametric approaches often excel at generating realistic results while having limited generalization ability, i.e., being restricted by the limitation of data and examples, e.g., data bias and long-tail distribution. On the other hand, parametric approaches, especially deep learning based approaches, have achieved promising results in recent years. Different from non-parametric methods, these approaches utilize image datasets as training data to fit deep parametric models, and have shown superior modeling power and generalization ability in image synthesis [11, 21], e.g., hallucinating diverse and relatively realistic images that are different from training data.

Generative Adversarial Networks (GANs) [11] are a type of parametric method that has been widely applied and studied for image synthesis. The main idea is to train paired generator and discriminator networks jointly, where the goal of the discriminator is to classify between ‘real’ images and generated ‘fake’ images, and the generator aims to fool the discriminator so that the generated images are indistinguishable from real images. Once trained, the generator can be used to synthesize images driven by a compact vector of noise. Compared to the blurry and low-resolution outcome from other deep learning methods [21, 4], GAN-based methods [35, 32, 17, 49] generate more realistic results with richer local details and of higher resolution.

It is worth highlighting several recent and concurrent works on sketch or color-constrained deep image synthesis. Scribbler [37] demonstrates an image synthesis framework that takes as input user sketches and short color strokes, and generates realistic looking output that follows the input sketch and has colorization schemes consistent to color strokes. A similar system is employed for automatically painting cartoon images [29]. Recently, a user-guided interactive image colorization system was proposed in [48], offering users the control of color when coloring or recoloring an input image. Distinct from these works, our system simultaneously supports richer user guidance signals including structural sketches, color patches and texture swatches. Moreover, we offer studies on the effect of several variants of improved loss functions and show synthesized results of various object categories.

Texture Synthesis and Style Transfer.
Texture synthesis and style transfer are two closely related topics in image synthesis. Given an input texture image, texture synthesis aims at generating new images with visually similar textures. Style transfer has two inputs – content and style images – and aims to synthesize images with the layout and structure of the content image and the texture of the style image. Non-parametric texture synthesis and style transfer methods typically re-sample given example images to form the output [6, 5, 40, 14]. TextureShop [7] is similar to our method in that it aims to texture an object with a user-provided texture, although TextureShop used non-parametric texture synthesis and shape-from-shading to foreshorten the texture so that it appears to follow the object surface.

Recently, a new deep style transfer method [8, 9] demonstrated that the correlations (i.e., Gram matrix) between features extracted from a pre-trained deep neural network capture the characteristics of textures well and showed promising results in synthesizing textures and transferring styles. In [8, 9], texture synthesis and style transfer are formalized as an optimization problem, where an output image is generated by minimizing a loss function of two terms, one of which measures content similarity between the input content image and the output, and the other measures style similarity between the input style and the output using the Gram matrix. Shortly after, there have been many work on improving [8, 9] from the aspects for generalization [46, 15, 26], efficiency [39, 19] and controllability [10].

Recently, several texture synthesis methods have used GANs to improve the quality of the generated results. [25] uses adversarial training to discriminate between real and fake textures based on a feature patch from the VGG network. Instead of operating on feature space, [18, 2] apply adversarial training at the pixel level to encourage the generated results to be indistinguishable from real texture. Our proposed texture discriminator in section 3.2.1 differs from prior work by comparing a pair of patches from generated and ground truth textures instead of using a single texture patch. Intuitively, our discriminator is tasked with the fine-grained question of “is this the same texture?” rather than the more general “is this a valid texture?”. Fooling such a discriminator is more difficult and requires our generator to synthesize not just realistic texture but also texture that is faithful to various input texture styles.

Similar to texture synthesis, image completion or inpainting methods have also showed promising results using GANs. Our task has similarities to the image completion problem, which attempts to fill in missing regions of an image, although our missing area is significantly larger and partially constrained by sketch, color, or texture. Similar to our approach, [43] computes texture loss between patches to encourage the inpainted region to be faithful to the original image regions. However, their texture loss only accounts for similarity in feature space. Our approach is similar in spirit to [16] which proposes using both global and local discriminators to ensure that results are both realistic and consistent with the image context, whereas our texture discriminator is instead checking consistency between input texture patch and output image.

We seek an image synthesis pipeline that can generate natural images based on an input sketch and some number of user-provided texture patches.
Users can provide rough sketches that outline the desired objects to control the generation of semantic content, e.g. object type and shape, but sketches do not contain enough information to guide the generation of texture details, materials, and patterns.
To guide the generation of fine-scale details, we want users to somehow control texture properties of objects and scene elements.

Towards this goal, we introduce TextureGAN, a conditional generative network that learns to generate realistic images from input sketches with overlaid textures.
We argue that instead of providing just an unanchored texture sample, users can more precisely control the generated appearance by directly placing small texture patches over the sketch, since locations and sizes of patches provide important information that influence the visual appearance.
In this setup, the user can ‘drag’ rectangular texture patches of arbitrary sizes into different sketch regions as additional input to the network. For example, the user can specify a striped texture patch for a shirt and a dotted texture patch for a skirt. The input patches guide the network to propagate the texture information to the relevant regions respecting semantic boundaries (e.g. dots should appear on the skirt but not on the legs).

A major challenge for a network learning this task is the uncertain pixel correspondence between the input texture and the unconstrained sketch regions. To encourage the network to produce realistic textures, we propose a patch-based texture loss 3.2.1 based on a texture discriminator and a Gram matrix style loss. This not only helps the generated texture follow the input faithfully, but also helps the network learn to propagate the texture patch and synthesize new texture.

TextureGAN also allows users to more precisely control the colors in the generated result. One limitation of previous color control with GANs [37] is that the input color constraints in the form of RGB need to fight with the network’s understanding about the semantics, e.g., bags are mostly black and shoes are seldom green.
To address this problem, we train the network to generate images in the Lab color space. We convert the groundtruth images to Lab, enforce the content, texture and adversarial losses only on the L channel, and enforce a separate color loss on the ab channels.
We show that combining the controls in this way allows the network to generate realistic photos closely following the user’s color and texture intent without introducing obvious visual artifacts.

Figure 1: TextureGAN pipeline. A feed-forward generative network is trained end-to-end to directly transform a 5-channel input to a high-res photo with realistic texture details. The red arrows indicate losses that are active for “texture fine tuning”. See text for detailed description of various losses.

Figure 1 shows our training pipeline. We use the network architecture proposed in Scribbler [37] with additional skip connections. Details of our network architecture are included in the supplementary material.
We use a 5-channel image as input to the network. The channels support three different types of controls – one channel for sketch, two channels for texture (one intensity and one binary location mask), and two channels for color (including but not limited to texture color).
Section 4.2 describes the method we used to generate each input channel of the network.

We first train TextureGAN to reproduce ground-truth shoe, handbag, and clothes photos given synthetically generated input control channels.
We then generalize TextureGAN to support a broader range of textures and to propagate unseen textures better by fine-tuning the network with a separate texture-only database.

3.1 Ground-truth Pre-training

We aim to propagate the texture information contained in small patches to fill in an entire object.
As in Scribbler [37], we use feature and adversarial losses to encourage the generation of realistic object structures. However, we find that these losses alone cannot reproduce fine-grained texture details.
Also, Scribbler uses pixel loss to enforce color constraints, but fails when the input color is rare for that particular object category.
Therefore, we redefine the feature and adversarial losses and introduce new losses to improve the replication of texture details and encourage precise propagation of colors.
For initial training, we derive the network’s input channels from ground-truth photos of objects. When computing the losses, we compare the generated images with the ground-truth.
Our objective function consists of multiple terms, each of which encourages the network to focus on different image aspects.

Feature Loss LF.
It has been shown previously that the features extracted from middle layers of a pre-trained neural network, VGG-19 [38], represent high-level semantic information of an image [12, 37]. Given a rough outline sketch, we would like the generated image to loosely follow the object structures specified by the sketch. Therefore, we decide to use a deeper layer of VGG-19 for feature loss (relu 4_2). To focus the feature loss on generating structures, we convert both the ground-truth image and the generated image from RGB color space to Lab and generate grayscale images by repeating the L channel values. We then feed the grayscale image to VGG-19 to extract features. The feature loss is defined as the L2 difference in the feature space. During back propagation, the gradients passing through the L channel of the output image are averaged from the three channels of the VGG-19 output.

Adversarial Loss LADV.
Generative adversarial networks (GANs) have been shown to generate realistic images from random noise seeds [11].
In GANs, a generator network and a discriminator network are trained simultaneously in a minimax game. The discriminator tries to distinguish generated images from real photos while the generator tries to generate realistic images tricking the discriminator into thinking they are real. By alternating the optimization of the generator and the discriminator until convergence, the generator would ideally generate images indistinguishable from real photos.

In recent work, the concept of adversarial training has also been adopted in the context of image to image translation. In particular, one can attach a trainable discriminator network at the end of the image translation network and use it to constrain the generated result to lie on the training image manifold. Previous work proposed to minimize the adversarial loss (loss from the discriminator network) together with other standard losses (pixel, feature losses, etc). The exact choice of losses depends on the different applications [37, 17, 12].
Our work follows the same line. We use adversarial loss on top of feature, texture and color losses. The adversarial loss pushes the network towards synthesizing sharp and realistic images, but at the same time constrains the generated images to choose among typical colors in the training images. The network’s understanding about color sometimes conflicts with user’s color constraints, e.g. a user provides a rainbow color constraint for a handbag, but the adversarial network thinks it looks fake and discourages the generator from producing such output.
Therefore, we propose applying the adversarial loss Ladv only on grayscale image (the L channel in Lab space). The discriminator is trained to disregard the color but focus on generating sharp and realistic details. The gradients of the loss only flow through the L channel of the generator output. This effectively reduces the search space and makes GAN training easier and more stable.
We perform the adversarial training using the techniques proposed in DCGAN [35] with the modification proposed in LSGAN [32]. LSGAN proposed replacing the cross entropy loss in the original GAN with linear least square loss for higher quality results and stable training.

Style Loss LS.
In addition to generating the right content following the input sketch, we would also like to propagate the texture details given in the input texture patch. The previous feature and adversarial losses sometimes struggle to capture fine-scale details, since they focus on getting the overall structure correct.
Similar to deep learning based texture synthesis and style transfer work [8, 9], we use style loss to specifically encourage the reproduction of texture details, but we apply style loss on the L channel only. We adopt the idea of matching the Gram matrices (feature correlations) of the features extracted from certain layers of the pretrained classification network (VGG-19). The Gram matrix Glij∈RNlxNl is defined as:

Glij=∑kFlikFljk

(1)

where, Nl is the number of feature maps at network layer l, Flik is the activation of the ith filter at position k in layer l.
We use two layers of the VGG-19 network (relu3_2, relu4_2) to define our style loss.

Pixel Loss LP.
We find that adding relatively weak L2 pixel loss on the L channel stabilizes the training and leads to the generation of texture details that are more faithful to the user’s input texture patch.

Color Loss LC.
All losses above are applied only on the L channel of the output to focus on generating sketch-conforming structures, realistic shading, and sharp high-frequency texture details. To enforce the user’s color constraints, we add a separate color loss that penalizes the L2 difference between the ab channels of the generated result and that of the ground-truth.

3.2 External Texture Fine-tuning

One problem of training with “ground-truth” images is that it is hard for the network to focus on reproducing low-level texture details due to the difficulty of disentangling the texture from the content within the same image. For example, we do not necessarily have training examples of the same object with different textures applied which might help the network learn the factorization between structure and texture.
Also, the Gram matrix-based style loss can be dominated by the feature loss since both are optimized for the same image. There is not much room for the network to be creative in hallucinating low-level texture details, since it tends to focus on generating high-level structure, color, and patterns.
Finally, many of the ground-truth texture patches contain smooth color gradients without rich details. Trained solely on those, the network is likely to ignore “hints” from an unseen input texture patch at test time, especially if the texture hint conflicts with information from the sketch.
As a result, the network often struggles to propagate high-frequency texture details in the results especially for textures that are rarely seen during training.

To train the network to propagate a broader range of textures, we fine-tune our network to reproduce and propagate textures for which we have no ground truth output. To do this, we introduce a new patch-based, local texture loss and adapt our existing losses to encourage faithfulness to a texture rather than faithfulness to a ground truth output object photo.
We use all the losses introduced in the previous sections except the global style loss LS. We keep the feature and adversarial losses, LF,LADV, unchanged, but modify the pixel and color losses, L′P,L′C, to compare the generated result with the entire input texture from which input texture patches are extracted. To prevent color and texture bleeding, the losses are applied only on the foreground object, as approximated by a segmentation mask (Section 4.1).

Patch-based Texture Loss

To encourage better propagation of texture, we propose a patch-based texture loss Lt, that is only applied to small local regions of the output image.
For handbags and shoes, we utilize foreground background separation (section 4.1) to locate the foreground object and apply the texture loss within the foreground.
For clothes, we use the detailed segmentation of clothes items [24, 28] to apply texture loss within the semantic region where the input patch is placed.
The texture loss Lt is composed of three terms:

Lt=Ls+wpLp+wadvLadv

(3)

Local Discriminator Loss Ladv. We introduce a patch-based adversarial loss that decides whether a pair of texture patches have the same texture. We train the discriminator to recognize a pair of cropped patches from the same texture as a positive example, and a pair of patches from different textures (one from the input texture and one from the generated result) as a negative example.
We define Ladv as follows:
Ladv=−∑(Dtxt(G(xi),It)−1)2. We use 0 to indicate fake and 1 to indicate real.

Local Style Loss Ls and Pixel Loss Lp.
To strengthen the texture propagation, we also use Gram matrix-based style loss and L2 pixel loss on the cropped patches.

We randomly sample two patches of size 50x50 from the generated result and the input texture (from the separate texture database), compute texture loss on the L channel of the patches and average them.
We propagate the gradients of the texture loss only through the corresponding patch region in the output.

While performing the texture fine-tuning, the network is trying to adapt itself to understand and propagate new types of textures, and might ‘forget’ what it learnt from the ground-truth pretraining stage. Therefore, when training on external textures, we mix in iterations of ground-truth training fifty percent of the time.

We train TextureGAN on three object-centric datasets – handbags[49], shoes[45] and clothes[27, 28, 30, 31].
Each photo collection contains large variations of colors, materials, and patterns. These domains are also chosen so that we can demonstrate plausible product design applications.
For supervised training, we need to generate (input, output) image pairs. For the output of the network, we convert the ground-truth photos to Lab color space. For the input to the network, we process the ground-truth photos to extract 5-channel images. The five channels include one channel for the binary sketch, two channels for the texture (intensities and binary location masks), and two channels for the color controls.

In this section, we describe how we obtain segmentation masks used during training, how we generate each of the input channels for the ground-truth pre-training, and how we utilize the separate texture database for the network fine-tuning. We also provide detailed training procedures and parameters.

4.1 Segmentation Mask

For our local texture loss, we hope to encourage samples of output texture to match samples of input texture. But the output texture is localized to particular image regions (e.g. the interior of objects) so we wouldn’t want to compare a background patch to an input texture. Therefore we only sample patches which fall inside an estimated foreground segmentation mask. Our handbag and shoe datasets are product images with consistent, white backgrounds so we simply set the white pixels as background pixels. For clothes, the segmentation mask is already given in the dataset. With the clothes segmentation mask, we process the ground-truth photos to white out the background. Note that segmentation masks are not used at test time.

4.2 Data Generation for Pre-training

Sketch Generation.
For handbags and shoes, we generate sketches using the deep edge detection method used in pix2pix [42, 17].
For clothes, we leverage the clothes parsing information provided in the dataset [27, 28]. We apply Canny edge detection on the clothing segmentation mask to extract the segment boundaries and treat them as a sketch. We also apply xDoG [41] on the clothes image to obtain more variation in the training sketches. Finally, we mix in additional synthetic sketches generated using the methods proposed in Scribbler [37].

Texture Patches.
To generate input texture constraints, we randomly crop small regions within the foreground objects of the ground-truth images.
We randomly choose the patch location from within the segmentation and randomize the patch size.
We convert each texture patch to the Lab color space and normalize the pixels to fall into 0-1 range.
For each image, we randomly generate one or two texture patches. For clothes, we extract texture patches from one of the following regions – top, skirt, pant, dress, or bag. We pass a binary mask to the network to indicate the spatial support of the texture.

4.3 Data Generation for Fine-tuning

To encourage diverse and faithful texture reproduction, we fine-tune TextureGAN by applying external texture patches from a leather-like texture dataset. We queried “leather” in Google and manually filtered the results to 130 high resolution leather textures. From this clean dataset, we sampled roughly 50 crops of size 256x256 from each image to generate a dataset of 6,300 leather-like textures. We train our models on leather-like textures since they are commonly seen materials for handbags, shoes and clothes and contain large appearance variations that are challenging for the network to propagate.

4.4 Training Details

For pre-training, we use the following parameters on all datasets. wADV=1, wS=0.1, wP=10 and wC=100. We use the Adam optimizer [20] with learning rate 1e-2.

For fine-tuning, we optimize all the losses at the same time but use different weight settings. wADV=1e4, wS=0, wP=1e2, wC=1e3, ws=10, wp=0.01, and wadv=7e3.
We also decrease the learning rate to 1e-3.
We train most of the models at input resolution of 128x128 except one clothes model at the resolution of 256x256 (Figure 5).

Figure 2: The effect of texture loss and adversarial loss. a) The network trained using all proposed losses can effectively propagate textures to most of the foreground region; b) Removing adversarial loss leads to blurry results; c) Removing texture loss harms the propagation of textures.

Ablation Study.
Keeping other settings the same, we train networks using different combinations of losses to analyze how they influence the result quality. In Figure 2, given the input sketch, texture patch and color patch (first column), the network trained with the complete objective function (second column) correctly propagates the color and texture to the entire handbag. If we turn off the texture loss (fourth column), the texture details within the area of the input patch are preserved, but difficult textures cannot be fully propagated to the rest of the bag. If we turn off the adversarial loss (third column), texture is synthesized, but that texture is not consistent with the input texture. Our ablation experiment confirms that style loss alone is not sufficient to encourage texture propagation motivating our local patch-based texture loss (Section 3.2.1).

Figure 3: Effect of Proposed local losses. a) Results from the ground-truth model without any local losses, b) with local pixel loss, c) with local style loss, d) with local texture discriminator loss.
With local discriminator loss, the network tends to produce more consistent texture throughout the object. Figure 4: Results for shoes and handbags on different textures. Odd rows: input sketch and texture patch. Even rows: generated results.

External Texture Fine-tuning Results.
We train TextureGAN on three datasets – shoes, handbags, and clothes – with increasing levels of structure complexity. We notice that for object categories like shoes that contain limited structure variations, the network is able to quickly generate realistic shading and structures and focus its remaining capacity for propagating textures. The texture propagation on the shoes dataset works well even without external texture fine-tuning.
For more sophisticated datasets like handbags and clothes, external texture fine-tuning is critical for the propagation of difficult textures that contain sharp regular structures, such as stripes.

Figure 3 demonstrates how external texture fine-tuning with our proposed texture loss can improves the texture consistency and propagation.
The “ground truth” pre-trained model is faithful to the input texture patch in the output only directly under the patch and does not propagate it throughout the foreground region.
By fine-tuning the network with texture examples and enforcing local style loss, local pixel loss, and local texture loss we nudge the network to apply texture consistently across the object.
With local style loss (column c) and local texture discriminator loss (column d), the networks are able to propagate texture better than without fine tuning (column a) or just local pixel loss (column b). Using local texture discriminator loss tends to produce more visually similar result to the input texture than style loss.

Figure 5: Applying multiple texture patches on the sketch. Our system can also handle multiple texture inputs and our network can follow sketch contours and expand the texture to cover the sketched object.Figure 6: Results on human-drawn sketches.

Figure 4 shows the results of applying various texture patches to sketches of handbags and shoes. These results are typical of test-time result quality.

Figure 5 shows results on the clothes dataset trained at a resolution of 256x256. The clothes dataset contains large variations of structures and textures, and each image in the dataset contains multiple semantic regions. Our network can also handle multiple texture patches. As shown in figure 5, we put different texture patches on different parts of the clothes (middle left and bottom left). The network can propagate the textures within semantic regions of the sketch while respecting the sketch boundaries.

Figure 6 shows results on human-drawn handbags. These drawings differ from our synthetically generated training sketches but the results are still high quality.

We have presented an approach for controlling deep image synthesis with input sketch and texture patches. With this system, the user can draw the object structure through sketching and precisely control the generated details with texture patches. TextureGAN is feed-forward which allows users to see the effect of their edits in real time. By training TextureGAN with local texture constraints, we demonstrate its effectiveness on sketch and texture-based image synthesis. TextureGAN also operates in Lab color space, which enables separate controls on color and content. Furthermore, our results on fashion datasets show that our pipeline is able to handle a wide variety of texture inputs and generates texture compositions that follow the sketched contours. In the future, we hope to apply our network on more complex scenes.