where $F$ and $k$ are discrete functions, $k$ defined on a grid of size $(2r + 1)^2$, dilated convolution introduces a dilation factor:

$(F\star_l k)(p) = \sum_{s+lt=p} F(s)k(t)$

In the proposed context module, Yu and Koltun stack several layers of dilated convolution with exponentially increasing dilation factors:

$F_{i+1} = F_i \star_{2^i} k_i$ for $i = 0,1,\ldots,n-2$

This allows to exponentially increase the receptive field while preserving the resolution (in contrast to pooling or strided convolution, which reduce resolution while increasing receptive field). The idea is illustrated in Figure 1.

Figure 1: Illustration of the first few layers of a context module with exponentially increasing dilation factor. Red points illustrate the sampled pixels while the green regions illustrate the increasing receptive field.

In practice, they also discuss initialization. Concretely, they were not able to improve semantic segmentation performance using standard initialization techniques. Instead they use an identity initialization:

$k^b(t, a) = 1_{[t=0]}1{[a=b]}$

where $a$ is the index of the input feature map, $b$ the index of the output feature map (assuming that the same number of output feature maps is computed); $t$ indixes the kernel location (i.e. only the middle weight is set to 1. They also generalize this scheme to cases where input and output feature maps do not match. Let $c_i$ and $c_{i+1}$ be the number of feature maps in layer $i$ and layer $i + 1$, then