Last Sunday our AI Lab Researcher Niraj Kale presented an amazing workshop on Object Detection with EfficientNet and EfficientDet – state-of-the-art algorithms which were published in 2019 by Google Brain team.

EfficientNet :-

Traditionally, one can scale ConvNets by depth (no of layers), width, or resolution.

Model Scaling – (a) is a baseline network. (b)-(d) are network scaling based on one dimension. (e) is compound scaling proposed by EfficientNet paper. Figure (a) illustrates a representative ConvNet, where the spatial dimension is gradually shrunk but the channel dimension is expanded over layers, for example, from initial input shape <224, 224, 3> to final output shape <7, 7, 512>.Image Credit : https://arxiv.org/abs/1905.11946

In research so far, scientist typically scale one of the above dimensions. Even if two dimensions were scaled, they were done arbitrarily.

A ConvNet Layer i can be defined as Yi = Fi(Xi). Here, Xi is input tensor with shape <Hi, Wi, Ci>. A ConvNet N can be represented as :-

In practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has five stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling.

In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. The optimization problem then becomes to increase model accuracy as per below scheme :-

The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints.

Pre-trained models such as ResNet can be scaled up from ResNet-50 to ResNet-200, as well as they can be scaled down from ResNet-50 to ResNet-18. The intuition is that a deeper network (depth scaling) can capture richer and more complex features, and generalizes well on new tasks. However, Vanishing gradients is one of the most common problems that arises as we go deep. Even if you avoid vanishing gradients, or use some techniques to make the training smooth, adding more layers doesn’t always help. For example, ResNet-1000 has similar accuracy as ResNet-101.

Width scaling is commonly used when we want to keep our model small. Wider networks tend to be able to capture more fine-grained features. Also, smaller models are easier to train. The problem is that even though you can make your network extremely wide, with shallow models (less deep but wider) accuracy saturates quickly with larger width..Next we come to Resolution Scaling. Intuitively, we can say that in a high-resolution image, the features are more fine-grained and hence high-res images should work better. This is also one of the reasons that in complex tasks, like Object detection, we use image resolutions like 300×300, or 512×512, or 600×600. But this doesn’t scale linearly. The accuracy gain diminishes very quickly. For example, increasing resolution from 500×500 to 560×560 doesn’t yield significant improvements.

Scaling Network Width for Different Baseline Networks. Each dot in a line denotes a model with different width coefficient (w).Image Credit : https://arxiv.org/abs/1905.11946

Note : Top-1 accuracy is the conventional accuracy: the model answer (the one with highest probability) must be exactly the expected answer. Top-5 accuracy means that any of your model 5 highest probability answers must match the expected answer.

Hence we resort to Combined Scaling. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling is a tedious task. Most of the times, manual scaling results in sub-optimal accuracy and efficiency.

It is critical to balance all dimensions of a network (width, depth, and resolution) during CNNs scaling for getting improved accuracy and efficiency.

Equation 3 : Compound Scaling method

φ is a user-specified coefficient that controls how many resources are available whereas α, β, and γ specify how to assign these resources to network depth, width, and resolution respectively.

The baseline architecture for scaling can be built from existing ConvNets, but we create a new mobile-size baseline, called EfficientNet-B0.

The authors obtained their base network by doing a Neural Architecture Search (NAS) that optimizes both accuracy and FLOPS.

From baseline network EfficientNet-B0, we apply compound scaling using a two step method :-

Fix φ =1, assuming that twice more resources are available, and do a small grid search for α, β, and γ based on equation 2 and 3. For baseline network B0, it turned out the optimal values are α =1.2, β = 1.1, and γ = 1.15 such that α.β2.γ2≈ 2.

Now fix α, β, and γ as constants and experiment with different values of φ using Equation 3, to obtain EfficientNet-B1 to B7.

In order to make the search space smaller and making the search operation less costly, the search for these parameters is done on small baseline network (Step 1 above), and then using same scaling coefficients for all other models (Step 2 above).

The results of EfficientNet scaling are indicated below for EfficientNet-B0 to B7 :-

Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient φ in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.Source Credit : https://arxiv.org/abs/1905.11946

EfficientNet models use many orders of magnitude lesser parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018).

Object detection before Deep Learning was a several step process, starting with edge detection and feature extraction using techniques like SIFT, HOG etc. These image were then compared with existing object templates, usually at multi scale levels, to detect and localize objects present in the image.

Detecting objects in different scales is challenging in particular for small objects. We can use a pyramid of the same image at different scale to detect objects (the left diagram below).

However, processing multiple scale images is time consuming and the memory demand is too high to be trained end-to-end simultaneously. Hence, we may only use it in inference to push accuracy as high as possible, in particular for competitions, when speed is not a concern.

Alternatively, we create a pyramid of feature and use them for object detection. However, feature maps closer to the image layer composed of low-level structures that are not effective for accurate object detection.

Feature Pyramid Network (FPN) is a feature extractor designed for such pyramid concept with accuracy and speed in mind. It replaces the feature extractor of detectors like Faster R-CNN and generates multiple feature map layers (multi-scale feature maps) with better quality information than the regular feature pyramid for object detection.

FPN composes of a bottom-up and a top-down pathway. The bottom-up pathway is the usual convolutional network for feature extraction. As we go up, the spatial resolution decreases. With more high-level structures detected, the semantic value for each layer increases.

The bottom-up pathway uses ResNet to construct the bottom-up pathway. It composes of many convolution modules (convi for i equals 1 to 5) each has many convolution layers. As we move up, the spatial dimension is reduced by 1/2 (i.e. double the stride). The output of each convolution module is labeled as Ci and later used in the top-down pathway.

We apply a 1 × 1 convolution filter to reduce C5 channel depth to 256-d to create M5. This becomes the first feature map layer used for object prediction.

As we go down the top-down path, we upsample the previous layer by 2 using nearest neighbors upsampling. We again apply a 1 × 1 convolution to the corresponding feature maps in the bottom-up pathway. Then we add them element-wise. We apply a 3 × 3 convolution to all merged layers. This filter reduces the aliasing effect when merged with the upsampled layer.

We repeat the same process for P3 and P2. However, we stop at P2 because the spatial dimension of C1 is too large. Otherwise, it will slow down the process too much. Because we share the same classifier and box regressor of every output feature maps, all pyramid feature maps (P5, P4, P3 and P2) have 256-d output channels.

Just like Mask R-CNN, FPN is also good at extracting masks for image segmentation. 5 × 5 windows are slide over the feature maps to generate 14 × 14 segments. Later, we merge masks at a different scale to form our final mask predictions.

EfficientDet :-

The creators of EfficientDet wanted to see if it is possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)? Their paper aims to tackle this problem by systematically studying various design choices of detector architectures. The paper examines the design choices for backbone, feature fusion, and class/box network, and seeks to solve for these two challenges :-

Efficient multi-scale feature fusion : Feature Pyramid Networks (FPN) has become the de facto for fusing multi-scale features. Some of the detectors that use FPNs are RetinaNet, PANet, NAS-FPN, etc. Most of the fusion strategies adopted in these networks don’t take into consideration the importance of filters while fusing. They sum them up without any distinction. Intuitively, not all features contribute equally to the output features. Hence, a better strategy for multi-scale fusion is required.

Model scaling : Most of the previous works tend to make the backbone network bigger for improving accuracy. The authors observed that scaling up feature network and box/class prediction networks are also critical when taking into account both accuracy and efficiency. Inspired by the compound scaling in EfficientNets, the authors proposed a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.

EfficientDet uses a BiFPN architecture. It also uses multi-scale feature fusion, that aims to aggregate features at different resolutions.

The conventional FPN aggregates multi-scale features in a top-down manner:

Cross-scale connections :

The problem with conventional FPN as shown in Figure (a) is that it is limited by the one-way (top-down) information flow. To address this issue, PANet adds an extra bottom-up path aggregation network, as shown in Figure (b) above.

Also, there are many papers, e.g. NAS-FPN, that also studied the cross-connections for capturing better semantics. In short, the game is all about the connections for connecting low-level features to high-level features and vice-versa for capturing better semantics.

Remove Nodes that only have one input edge. If a node has only one input edge with no feature fusion, then it will have less contribution to the feature network that aims at fusing different features.

Add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost.

Treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.

Weighted Feature Fusion :

Previous feature fusion methods treat all input features equally without distinction. However, we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input during feature fusion, and let the network to learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches:

Unbounded fusion :

where wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). To make the training stable, we resort to weight normalization to bound the value range of each weight.

Softmax-based fusion :

An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. The extra softmax leads to significant slowdown on GPU hardware.

Fast normalized fusion :

To minimize the extra latency cost, we further propose a fast fusion approach.

where wi >= 0 is ensured by applying a Relu after each wi, and ε = 0.0001 is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more efficient. The authors’ ablation study shows this fast fusion approach is better than softmax-based fusion, but runs up to 30% faster on GPUs.

The final BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.

where P6td is the intermediate feature at level 6 on the topdown pathway, and P6out is the output feature at level 6 on the bottom-up pathway.

Now lets look at EfficientDet architecture.

EfficientDet architecture – It employs EfficientNet [36] as the backbone network, BiFPN as the feature network, and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table 1.Image Credit : https://arxiv.org/abs/1911.09070

EfficientDet Architecture :

EfficientDet detectors are single-shot detectors much like SSD and RetinaNet. The backbone networks are ImageNet pretrained EfficientNets. The proposed BiFPN serves as the feature network, which takes level 3–7 features {P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. The class and box network weights are shared across all levels of features.

We have already seen in EfficientNets that scaling all dimensions provides much better performance. We would like to do the same for our EfficientDet family models. Previous works in object detection scale only the backbone network or the FPN layers for improving accuracy. This is very limiting as we are focusing on scaling only one dimension of the detector. The authors proposed a new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network, and resolution.

Object detectors have much more scaling dimensions than image classification models, so grid search for all dimensions is very expensive. Therefore, the authors used a heuristic-based scaling approach, but still, follow the main idea of jointly scaling up all dimensions.

Backbone network: Same width/depth scaling coefficients of EfficientNet-B0 to B6 are used so that ImageNet-pretrained checkpoints can be used.

BiFPN network: The authors exponentially grow BiFPN width (#channels) as done in EfficientNets, but linearly increase the depth (#layers) since depth needs to be rounded to small integers. After a grid search, 1.35 is detected as best scale factor for width.

Equation 1

Box/class prediction network: The width is kept same as the BiFPN but the depth (#layers) is linearly increased.

Equation 2

Input image resolution: Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by 27 = 128, so we linearly increase resolutions using equation:

Equation 3

Now, using equations (1), (2), and (3), and different values of φ , we can go from Efficient-D0 (φ =0) to Efficient-D6 (φ =6) as shown in Table 1 below. The models scaled up with φ >= 7 could not fit memory unless changing batch size or other settings. Therefore, the authors expanded D6 to D7 by only enlarging input size while keeping all other dimensions the same, such that we can use the same training settings for all models. Here is a table summarizing all these configs: