Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

5.
● When we switch from MLP to CNN, we get shift invariance for free, but
NO rotation invariance, NO scale invariance, NO flip invariance, etc.
● Key object “in the wild” may have different amount of context.
Problems Formulation

7.
Reference: https://arxiv.org/abs/1506.02025 from DeepMind
Key features:
● The action of the STN is conditioned on individual sample
● Does much more than attention: cropping, translation, rotation, scale, skew
● No additional supervision
● No modifications of the optimization process
● Computationally efficient
● Can be applied to any feature map or particularly to input image
Spatial Transformer Network

9.
Notes: Localization network
1.Should predict 6 (2 x 3) parameters for 2D transform. 12 (3 x 4) for 3D
transform and in general case N(N+1) parameters
2.LocNet’s “Feature extractor” can have any structure
In practice, for the most of the problems 2 x Conv + Dense is enough.
3.Can be modified to predict several transformations

10.
1. The transformation can have any parameterized form
differentiable with respect to the parameters
2. In 2D case:
3. In particular, attention is:
Notes: Grid generator and transforms

18.
Equivariance vs Invariance
● CNN try to make the neural activities invariant to small changes in viewpoint.
● But it is better to aim for equivariance: changes in viewpoint should lead to
corresponding changes in neural activities.

23.
Capsules. Step 1
● u - output vectors of previous capsules.
○ Length encodes probabilities of detected corresponding object.
○ The direction encodes some internal state of the detected object.
● W - affine transformation matrix. Encode important spatial and other relationships
between lower level features.
○ For example, matrix W12 may encode relationships between nose and face.
● u_hat - predicted position of the higher level feature.
○ For example, u1_hat represent where the face should be according to the
detected position of the eyes.

24.
Capsules. Step 2 scalar weighting of input vector
softmax routing of output
from one low-level
capsule
send
more
send
less
Capsule j Capsule k
● Lower level capsule will
send its input to the higher
level capsule that “agrees”
with its input.
● This is the essence of the
dynamic routing
algorithm
c12
c11

27.
Dynamic Routing. General idea
● Once again: Capsule output is a vector = internal representation of a feature.
● Want to maximize capsule K instead of capsule J as it’s closer to the “red” point
cluster

32.
Loss function (Margin loss)
● Capsules use a separate margin loss for each category in the picture.
which Tc=1 if an object of class cc is present. m+ = 0.9 and m− = 0.1.
● The λ down-weighting (default 0.5) stops the initial learning from shrinking the
activity vectors of all classes.
● The total loss is the sum of the losses of all classes.
Capsules use a separate margin loss for each category in the picture.