Swapout: Learning an ensemble of deep architecturesSwapout: Learning an ensemble of deep architecturesSaurabh Singh and Derek Hoiem and David Forsyth2016

Paper summaryaleju * They describe a regularization method similar to dropout and stochastic depth.
* The method could be viewed as a merge of the two techniques (dropout, stochastic depth).
* The method seems to regularize better than any of the two alone.
### How
* Let `x` be the input to a layer. That layer produces an output. The output can be:
* Feed forward ("classic") network: `F(x)`.
* Residual network: `x + F(x)`.
* The standard dropout-like methods do the following:
* Dropout in feed forward networks: Sometimes `0`, sometimes `F(x)`. Decided per unit.
* Dropout in residual networks (rarely used): Sometimes `0`, sometimes `x + F(x)`. Decided per unit.
* Stochastic depth (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per *layer*.
* Skip forward (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per unit.
* **Swapout** (any network): Sometimes `0`, sometimes `F(x)`, sometimes `x`, sometimes `x + F(x)`. Decided per unit.
* Swapout can be represented using the formula `y = theta_1 * x + theta_2 * F(x)`.
* `*` is the element-wise product.
* `theta_1` and `theta_2` are tensors following bernoulli distributions, i.e. their values are all exactly `0` or exactly `1`.
* Setting the values of `theta_1` and `theta_2` per unit in the right way leads to the values `0` (both 0), `x` (1, 0), `F(x)` (0, 1) or `x + F(x)` (1, 1).
* Deterministic and Stochastic Inference
* Ideally, when using a dropout-like technique you would like to get rid of its stochastic effects during prediction, so that you can predict values with exactly *one* forward pass through the network (instead of having to average over many passes).
* For Swapout it can be mathematically shown that you can't calculate a deterministic version of it that performs equally to the stochastic one (averaging over many forward passes).
* This is even more the case when using Batch Normalization in a network. (Actually also when not using Swapout, but instead Dropout + BN.)
* So for best results you should use the stochastic method (averaging over many forward passes).
### Results
* They compare various dropout-like methods, including Swapout, applied to residual networks. (On CIFAR-10 and CIFAR-100.)
* General performance:
* Results with Swapout are better than with the other methods.
* According to their results, the ranking of methods is roughly: Swapout > Dropout > Stochastic Depth > Skip Forward > None.
* Stochastic vs deterministic method:
* The stochastic method of swapout (average over N forward passes) performs significantly better than the deterministic one.
* Using about 15-30 forward passes seems to yield good results.
* Optimal parameter choice:
* Previously the Swapout-formula `y = theta_1 * x + theta_2 * F(x)` was mentioned.
* `theta_1` and `theta_2` are generated via Bernoulli distributions which have parameters `p_1` and `p_2`.
* If using fixed values for `p_1` and `p_2` throughout the network, it seems to be best to either set both of them to `0.5` or to set `p_1` to `>0.5` and `p_2` to `<0.5` (preference towards `y = x`).
* It's best however to start both at `1.0` (always `y = x + F(x)`) and to then linearly decay them to both `0.5` towards the end of the network, i.e. to apply less noise to the early layers. (This is similar to the results in the Stochastic Depth paper.)
* Thin vs. wide residual networks:
* The standard residual networks that they compared to used a `(16, 32, 64)` pattern for their layers, i.e. they started with layers of each having 16 convolutional filters, followed by some layers with each having 32 filters, followed by some layers with 64 filters.
* They tried instead a `(32, 64, 128)` pattern, i.e. they doubled the amount of filters.
* Then they reduced the number of layers from 100 down to 20.
* Their wider residual network performed significantly better than the deep and thin counterpart. However, their parameter count also increased by about `4` times.
* Increasing the pattern again to `(64, 128, 256)` and increasing the number of layers from 20 to 32 leads to another performance improvement, beating a 1000-layer network of pattern `(16, 32, 64)`. (Parameter count is then `27` times the original value.)
* Comments
* Stochastic depth works layer-wise, while Swapout works unit-wise. When a layer in Stochastic Depth is dropped, its whole forward- and backward-pass don't have to be calculated. That saves time. Swapout is not going to save time.
* They argue that dropout+BN would also profit from using stochastic inference instead of deterministic inference, just like Swapout does. However, they don't mention using it for dropout in their comparison, only for Swapout.
* They show that linear decay for their parameters (less dropping on early layers, more on later ones) significantly improves the results of Swapout. However, they don't mention testing the same thing for dropout. Maybe dropout would also profit from it?
* For the above two points: Dropout's test error is at 5.87, Swapout's test error is at 5.68. So the difference is already quite small, making any disadvantage for dropout significant.
![Visualization](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Swapout__visualization.png?raw=true "Visualization")
*Visualization of how Swapout works. From left to right: An input `x`; a standard layer is applied to the input `F(x)`; a residual layer is applied to the input `x + F(x)`; Skip Forward is applied to the layer; Swapout is applied to the layer. Stochastic Depth would be all units being orange (`x`) or blue (`x + F(x)`).*

First published: 2016/05/20 (2 years ago)Abstract: We describe Swapout, a new stochastic training method, that outperforms
ResNets of identical network structure yielding impressive results on CIFAR-10
and CIFAR-100. Swapout samples from a rich set of architectures including
dropout, stochastic depth and residual architectures as special cases. When
viewed as a regularization method swapout not only inhibits co-adaptation of
units in a layer, similar to dropout, but also across network layers. We
conjecture that swapout achieves strong regularization by implicitly tying the
parameters across layers. When viewed as an ensemble training method, it
samples a much richer set of architectures than existing methods such as
dropout or stochastic depth. We propose a parameterization that reveals
connections to exiting architectures and suggests a much richer set of
architectures to be explored. We show that our formulation suggests an
efficient training method and validate our conclusions on CIFAR-10 and
CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider
model performs similar to a 1001 layer ResNet model.

This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as
$Y = X + F(X)$
where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation:
$Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$
where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term.
The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last.
In addition to this observation, I also note the following empirical observations:
1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard.
2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2].
Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice.
[1]: http://arxiv.org/abs/1603.09382v1
[2]: https://arxiv.org/abs/1605.07146

Swapout is a method that stochastically selects forward propagation in a neural network from a palette of choices: drop, identity, feedforward, residual. Achieves best results on CIFAR-10,100 that I'm aware of.
This paper examines a stochastic training method for deep architectures that is formulated in such a way that the method generalizes dropout and stochastic depth techniques. The paper studies a stochastic formulation for layer outputs which could be formulated as $Y =\Theta_1 \odot X+ \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are tensors of i.i.d. Bernoulli random variables. This allows layers to either: be dropped $(Y=0)$, act a feedforward layer $Y=F(X)$, be skipped $Y=X$, or behave like a residual network $Y=X+F(X)$. The paper provides some well reasoned conjectures as to why "both dropout and swapout networks interact poorly with batch normalization if one uses deterministic inference", while also providing some nice experiments on the importance of the choice of the form of stochastic training schedules and the number of samples required to obtain estimates that make sampling useful. The approach is able to yield performance improvement over comparable models if the key and critical details of the stochastic training schedule and a sufficient number of samples are used.
This paper proposes a generalization of some stochastic regularization techniques for effectively training deep networks with skip connections (i.e. dropout, stochastic depth, ResNets.) Like stochastic depth, swapout allows for connections that randomly skip layers, which has been shown to give improved performance--perhaps due to shorter paths to the loss layer and the resulting implicit ensemble over architectures with differing depth. However, like dropout, swapout is independently applied to each unit in a layer allowing for a richer space of sampled architectures. Since accurate expectation approximations are not easily attainable due to the skip connections, the authors propose stochastic inference (in which multiple forward passes are averaged during inference) instead of deterministic inference. To evaluate its effectiveness, the authors evaluate swapout on the CIFAR dataset, showing improvements over various baselines.

* They describe a regularization method similar to dropout and stochastic depth.
* The method could be viewed as a merge of the two techniques (dropout, stochastic depth).
* The method seems to regularize better than any of the two alone.
### How
* Let `x` be the input to a layer. That layer produces an output. The output can be:
* Feed forward ("classic") network: `F(x)`.
* Residual network: `x + F(x)`.
* The standard dropout-like methods do the following:
* Dropout in feed forward networks: Sometimes `0`, sometimes `F(x)`. Decided per unit.
* Dropout in residual networks (rarely used): Sometimes `0`, sometimes `x + F(x)`. Decided per unit.
* Stochastic depth (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per *layer*.
* Skip forward (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per unit.
* **Swapout** (any network): Sometimes `0`, sometimes `F(x)`, sometimes `x`, sometimes `x + F(x)`. Decided per unit.
* Swapout can be represented using the formula `y = theta_1 * x + theta_2 * F(x)`.
* `*` is the element-wise product.
* `theta_1` and `theta_2` are tensors following bernoulli distributions, i.e. their values are all exactly `0` or exactly `1`.
* Setting the values of `theta_1` and `theta_2` per unit in the right way leads to the values `0` (both 0), `x` (1, 0), `F(x)` (0, 1) or `x + F(x)` (1, 1).
* Deterministic and Stochastic Inference
* Ideally, when using a dropout-like technique you would like to get rid of its stochastic effects during prediction, so that you can predict values with exactly *one* forward pass through the network (instead of having to average over many passes).
* For Swapout it can be mathematically shown that you can't calculate a deterministic version of it that performs equally to the stochastic one (averaging over many forward passes).
* This is even more the case when using Batch Normalization in a network. (Actually also when not using Swapout, but instead Dropout + BN.)
* So for best results you should use the stochastic method (averaging over many forward passes).
### Results
* They compare various dropout-like methods, including Swapout, applied to residual networks. (On CIFAR-10 and CIFAR-100.)
* General performance:
* Results with Swapout are better than with the other methods.
* According to their results, the ranking of methods is roughly: Swapout > Dropout > Stochastic Depth > Skip Forward > None.
* Stochastic vs deterministic method:
* The stochastic method of swapout (average over N forward passes) performs significantly better than the deterministic one.
* Using about 15-30 forward passes seems to yield good results.
* Optimal parameter choice:
* Previously the Swapout-formula `y = theta_1 * x + theta_2 * F(x)` was mentioned.
* `theta_1` and `theta_2` are generated via Bernoulli distributions which have parameters `p_1` and `p_2`.
* If using fixed values for `p_1` and `p_2` throughout the network, it seems to be best to either set both of them to `0.5` or to set `p_1` to `>0.5` and `p_2` to `<0.5` (preference towards `y = x`).
* It's best however to start both at `1.0` (always `y = x + F(x)`) and to then linearly decay them to both `0.5` towards the end of the network, i.e. to apply less noise to the early layers. (This is similar to the results in the Stochastic Depth paper.)
* Thin vs. wide residual networks:
* The standard residual networks that they compared to used a `(16, 32, 64)` pattern for their layers, i.e. they started with layers of each having 16 convolutional filters, followed by some layers with each having 32 filters, followed by some layers with 64 filters.
* They tried instead a `(32, 64, 128)` pattern, i.e. they doubled the amount of filters.
* Then they reduced the number of layers from 100 down to 20.
* Their wider residual network performed significantly better than the deep and thin counterpart. However, their parameter count also increased by about `4` times.
* Increasing the pattern again to `(64, 128, 256)` and increasing the number of layers from 20 to 32 leads to another performance improvement, beating a 1000-layer network of pattern `(16, 32, 64)`. (Parameter count is then `27` times the original value.)
* Comments
* Stochastic depth works layer-wise, while Swapout works unit-wise. When a layer in Stochastic Depth is dropped, its whole forward- and backward-pass don't have to be calculated. That saves time. Swapout is not going to save time.
* They argue that dropout+BN would also profit from using stochastic inference instead of deterministic inference, just like Swapout does. However, they don't mention using it for dropout in their comparison, only for Swapout.
* They show that linear decay for their parameters (less dropping on early layers, more on later ones) significantly improves the results of Swapout. However, they don't mention testing the same thing for dropout. Maybe dropout would also profit from it?
* For the above two points: Dropout's test error is at 5.87, Swapout's test error is at 5.68. So the difference is already quite small, making any disadvantage for dropout significant.
![Visualization](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Swapout__visualization.png?raw=true "Visualization")
*Visualization of how Swapout works. From left to right: An input `x`; a standard layer is applied to the input `F(x)`; a residual layer is applied to the input `x + F(x)`; Skip Forward is applied to the layer; Swapout is applied to the layer. Stochastic Depth would be all units being orange (`x`) or blue (`x + F(x)`).*