Famous Convolutional Neural Network Architectures – #2

I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.

January 30, 2019

It has certainly been a while that I posted. Anyways, let’s get this thing rolling! In thelast post, we went over some variants of convolution operations. This made us ready to get into some of the more advanced and efficient Convolutional Neural Net (CNN) architectures!

Let’s go over some of the powerful Convolutional Neural Networks which are certainly having a big impact on current computer vision industry.

INDEX

If you are here for a particular architecture, directly click on the link and jump to the topic.

MobileNets – Howard et al

MobileNets, a class of efficient models, are based on depthwise separable convolutions. This gives the model a chance to reduce the number of parameters required for convolutional operations thus reducing the size of the model!

Google scientists’created this class of CNN architectures to get deep learning models accessible to smaller, less powerful devices like your smartphone. Let’s have a look at its architecture!

MobileNet – Architecture

The architecture followed an easy to replicate pattern!

Every conv layer, which is not Pointwise convolution, will have filter size – 3×3.

MobileNet – CODE

ResNeXt – Xie et al

An extension ofDeep Residual Networksinspired by the split-transform-aggregate strategy used inInception blocks. Rather than applying a residual block over the incoming feature map, we use branches (no. of branches = cardinality of network) of operations which merge into one and then a residual combination operation is applied!

Four things were discovered:

Grouped convolutions led to specialization. Each group focused on different attributes of the input image. This means that each branch in a block has its own specialized attribute of the image that it was interested in.

The experimental results displayed that increasing cardinality is more effective at increasing model performance than increasing depth or width of the network!

Residual connections are crucial part in optimization!

Aggregated transformations give strong representations!

ResNeXt – Architecture

ResNeXt Architecture can be considered as ResNet in disguise. We’ve a ResNeXt block that has branches, each of which is a simple feature extractor with three convolution layers stacked that act as a bottleneck. The second convolution layer is a grouped convolution operator, i.e., each channel in input has a separate filter (just like in depthwise convolution). The other thing to keep in mind is that after every convolution operation, a batch normalization and relu operations are applied. The complete architecture is displayed in figure and table below.

Layer Name

Output Size

32-layer

conv1

112x112

7x7, 64, stride 2

Maxpool

56x56

3x3 max pool, stride 2

conv2_x

56x56

1x1, 128

x 3

3x3, 128, C=32

1x1, 256

conv3_x

28x28

1x1, 256

x 4

3x3, 256, C=32

1x1, 512

conv4_x

14x14

1x1, 512

x 6

3x3, 512, C=32

1x1, 1024

conv5_x

7x7

1x1, 1024

x 3

3x3, 1024, C=32

1x1, 2048

Global Average Pooling

Fully Connected, units=1000

Softmax

ResNeXt – CODE

SqueezeNet – Landola et al

Want to have a memory efficient Deep Neural Network that can work on embedded devices? This might be your go to network! Let me tell you a secret, SqueezeNet achievesAlexNetlike accuracy on ImageNet and yet has a 50 times fewer parameter than it. Interesting, isn’t it?

Following three main strategies were employed to do so:

Replace 3×3 filters with 1×1 filters. The number of params in a 1×1 conv are 9 times lesser than the params in 3×3 conv.

Decrease the number of input channels to 3×3 filters. The number of params in a 3×3 conv layer = (no. of input channels) x (no. of output channels) x 3 x 3. If we reduce no. of input channels, we reduce the number of params!

Downsample late in the network so that convolution layers have large activation map. The paper Convolutional neural networks at constrained time cost by He and Sun displays that delayed downsampling leads to a higher accuracy. This gave an intuitive idea to the authors to use delayed downsampling and working with large activation maps.

SqueezeNet – Architecture

The squeezenet architecture is made up of what authors called a “Fire module”. Sure is a Godly name – “Lord of light”. (Okay okay no more Game of Thrones reference, Piyush!). Let’s move on.

So, what is a fire module? It is a fancy name for a bottleneck layer.

The fire module contains two things:

A Squeeze layer that, as the name suggests, squeezes the input channels!

An expand layer that, again as the name suggests, expands the input channels!

The expand layer of the fire module contains two types of convolutions:

A 1×1 convolution inspired from strategy 1 discussed above.

A 3×3 convolution.

The number of channels in Squeeze layer is less than sum of the number of channels in 1×1 and 3×3 layers. This is inspired from strategy 2 discussed above.

Now, let’s have a look at the complete architecture. The complete architecture has 3 forms:

A simple Squeezenet with no shortcut connections. (The leftmost one)

A Squeezenet with “Simple bypass” connections. These connections are placed between layers which have same number of output channels. (The middle one)

A Squeezenet with “Complex bypass” connections. These connections use 1×1 convolutions to add shortcut connections between layers with different number of output channels. (The rightmost one)

Finally, let’s have a look at parameters of these architectures.

Layer name

Output Size

Filter & Stride

Squeeze layer filters

Expand layer (1x1) filters

Expand layer (3x3) filters

Input

224x224x3

conv1

111x111x96

7x7 s2 x96

maxpool1

55x55x96

3x3 s2

fire2

55x55x128

16

64

64

fire3

55x55x128

16

64

64

fire4

55x55x256

32

128

128

maxpool2

27x27x256

3x3 s2

fire5

27x27x256

32

128

128

fire6

27x27x384

48

192

192

fire7

27x27x384

48

192

192

fire8

27x27x512

64

256

256

maxpool3

13x13x512

3x3 s2

fire9

13x13x512

64

256

256

conv10

13x13x1000

1x1 s1 x1000

avgpool10

1x1x1000

13x13 s1

SqueezeNet – CODE

DenseNet – Huang et al

The paper that was titled the best paper atCVPR 2017. The paper leveraged the idea of ResNets and built over it for better flow of information between layers. This was achieved by proposing a connectivity pattern in which each layer was connected to all subsequent layers. That means lth layer received feature maps from all preceding layers. (We use concatenation operation to merge). Let’s look at the architecture now!

DenseNet – Architecture

Dense Block: A block of convolution layers such that every layer is connected (read concatenated) to every subsequent layer in the block.

Transition Block: A block where we downsample the information as we move from one Dense Block to another.

There were quite a few of things that were gone into deciding the structure of the model:

Composite functions: Every conv layer shown is a combination of three consecutive functions – Batch Normalization, ReLU and Convolution.

Growth Rate: If each layer produces k feature maps then the lth layer will produce k0 + k(l-1) feature maps. (k0 being the number of input feature maps). An important difference here is that the DenseNet can have very narrow layers. According to the author, one explanation can be that every layer has information of all the preceding layers.

Bottleneck layers: As we know that we can use 1×1 convs to make a bottleneck and reduce number of params in a 3×3 convs. This is done in the dense blocks to achieve computational efficiency. The authors used 4k filters for 1×1 conv.

Compression: To further the efficiency of the model, the authors reduced the number of feature maps at transition layers by a factor of Θ (reduction hyperparameter). This meant if a dense block outputs m channels then the number of channels a transition block will output = Θ m. To do this, the authors used a 1×1 conv layer before using the Pooling layer in the transition block.

Note: Authors used 2k filters in the initial convolution layer!

Let’s have a look at the complete architecture now:

Layer Name

Output Size

DenseNet-121

DenseNet-169

DenseNet-201

conv1

112x112

7x7, 64, stride 2

Maxpool

56x56

3x3 max pool, stride 2

Dense Block 1

56x56

1x1 conv

x 6

3x3 conv

1x1 conv

x 6

3x3 conv

1x1 conv

x 6

3x3 conv

Transition layer 1

56x56

1x1 conv

28x28

2x2 average pool, stride 2

Dense Block 2

28x28

1x1 conv

x 12

3x3 conv

1x1 conv

x 12

3x3 conv

1x1 conv

x 12

3x3 conv

Transition layer 2

28x28

1x1 conv

14x14

2x2 average pool, stride 2

Dense Block 3

14x14

1x1 conv

x 24

3x3 conv

1x1 conv

x 32

3x3 conv

1x1 conv

x 48

3x3 conv

Transition layer 3

14x14

1x1 conv

7x7

2x2 average pool, stride 2

Dense Block 4

7x7

1x1 conv

x 16

3x3 conv

1x1 conv

x 32

3x3 conv

1x1 conv

x 32

3x3 conv

Global Average Pooling

Fully Connected, units=1000

Softmax

DenseNet – CODE

So, we reached the end of this post. It certainly was fun to write this one. I still wanted to include one more CNN architecture but decided not to. This Convolutional Neural Network Architecture is so special that it will have a post of its own. The NASNets are coming soon. 😉