First published: 2016/02/08 (3 years ago)Abstract: Image-generating machine learning models are typically trained with loss
functions based on distance in the image space. This often leads to
over-smoothed results. We propose a class of loss functions, which we call deep
perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of
computing distances in the image space, we compute distances between image
features extracted by deep neural networks. This metric better reflects
perceptually similarity of images and thus leads to better results. We show
three applications: autoencoder training, a modification of a variational
autoencoder, and inversion of deep convolutional networks. In all cases, the
generated images look sharp and resemble natural images.

This paper proposed a class of loss functions applicable to image generation that are based on distance in feature spaces:
$$\mathcal{L} = \lambda_{feat}\mathcal{L}_{feat} + \lambda_{adv}\mathcal{L}_{adv} + \lambda_{img}\mathcal{L}_{img}$$
### Key Points
- Using only l2 loss in image space yields over-smoothed results since it leads to averaging all likely locations of details.
- L_feat measures the distance in suitable feature space and therefore preserves distribution of fine details instead of exact locations.
- Using only L_feat yields bad results since feature representations are contractive. Many non-natural images also mapped to the same feature vector.
- By introducing a natural image prior - GAN, we can make sure that samples lie on the natural image manifold.
### Model
https://i.imgur.com/qNzMwQ6.png
### Exp
- Training Autoencoder
- Generate images using VAE
- Invert feature
### Thought
I think the experiment section is a little complicated to comprehend. However, the proposed loss seems really promising and can be applied to many tasks related to image generation.
### Questions
- Section 4.2 & 4.3 are hard to follow for me, need to pay more attention in the future

This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand.
By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect.
### Key Points
- DGN (natural image prior) generates more coherent images when optimizing fully-connected layer codes instead of low-level codes. However, previous studies showed that low-level features results in better reconstructions beacuse it contains more image details. The difference is that here DGN-AM is trying to synthesize an entire layer code from scratch. Features in low-level only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more high-dimensional, making it harder to optimize.
- The learned prior trained on ImageNet can also generalize to Places.
- It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.
- The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.
- When visualizing hidden neurons, DGN-AM trained on ImageNet also generalize to Places and produce similar results as [1].
- The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer.
### Model
![](https://cloud.githubusercontent.com/assets/7057863/21002626/b094d7ae-bd61-11e6-8c95-fd4931648426.png)
### Thought
Solid paper with diverse visualizations and thorough analysis.
### Reference
[1] Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al.

This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to
transfer discriminative visual knowledge from well-established visual recognition models into the sound modality.
The learned sound representation yields significant performance improvements on standard benchmarks for acoustic
scene classification task.
### Key Points
- The natural synchronization between vision and sound can be leveraged as a supervision signal for each other.
- Cross-modal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well.
- In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance.
### Model
- The authors proposed a student-teacher training procedure to transfer discriminative visual knowledge from visual recognition models
trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions.
![](https://cloud.githubusercontent.com/assets/7057863/20856609/05fe12d6-b94e-11e6-8c92-995ee84fe0d7.png)
- Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higher-level concepts.
### Exp
- Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%.
- Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8-layer networks
performs poorly due to overfitting.
- Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%.
- Multi-modal recognition models use visual and sound data together yields 2% gain in classification accuracy.
### Thought
I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements.
### Questions
- Although paper said that "To handle variable-temporal-length of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds.
- It looks not clear for me about the data augmentation technique used in training.

The authors propose an algorithm for meta-learning that is compatible with any model trained with gradient descent, and show that it works on various domain including supervised learning and reinforcement learning. This is done by explicitly train the network such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.
### Key Points
- MAML is actually finding a good **initialization** of model parameters for several tasks.
- Good initialization of parameters means that it can achieve good performance on several tasks with small number of gradient steps.
### Method
- Simultaneously optimize the **initialization** of model parameters of different meta-training tasks, hoping that it can quickly adapt to new meta-testing tasks.
![](https://cloud.githubusercontent.com/assets/7057863/25161911/46f2721e-24f1-11e7-9fba-8bc2f0782204.png)
- Training procedure:
![](https://cloud.githubusercontent.com/assets/7057863/25161749/8d00902a-24f0-11e7-93a8-6a9b74386f55.png)
### Exp
- It acheived performance that is comparable to the state-of-the-art on classification/regression/reinforcement learning tasks.
### Thought
I think the experiments are thorough since they proved that this technique can be applied to both supervised and reinforcement learning. However, the method is not novel provided that [Optimization a A Midel For Few-shot Learning](https://openreview.net/pdf?id=rJY0-Kcll) already proposed to learn initialization of parameters.