[深度学习论文笔记][Attention] Spatial Transformer Networks
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015. (Citations: 116).
1 Motivation
SAT restricts attention to a static grid structure. Our aim is for the model to focus on any part of the image within the grid.
The pooling operation is capable of making a network somewhat spatially invariance concerning the position of features. Despite the fact that max-pooling typically has a small spatial support, this characteristic is retained.
spatial invariance only exists through a deeply structured hierarchy of max-pooling layers and convolutional operations, while the intermediate feature maps within a CNN do not exhibit true invariance under significant input transformations.
It aims to introduce a spatial transformer module that intelligently selects key features (attention) and applies scaling; cropping; rotation; non-rigid transformations to them.
deformations.
2 空间变换器
我们需要一个可微分模块,在单次前向传递中对特征图施加空间变换。对于每个像素坐标(x_s, y_s),我们通过计算相应的输出坐标(x_t, y_t)来实现这一过程。
We normalize coordinate points x_s and y_s into the interval [-1, 1]. This transformation permits the execution of cropping operations, translation with offsets s_x and s_y, rotation operations by θ degrees (clockwise), scaling by a factor of α relative to a reference point (x_0,y_0), and applying shear transformations along both horizontal and vertical axes.

Each channel undergoes the same warping when dealing with multi-channel inputs. By processing every pixel in the output, we establish a sampling grid. This allows us to compute the final output efficiently using bilinear interpolation.
3 Architecture
See Fig. One can also use multiple spatial transformers in parallel — this can be useful if there are multiple objects or parts of interest in a feature map that should be
focussed on individually. A limitation of this architecture in a purely feed-forward network is that the number of parallel spatial transformers limits the number of objects that the
network can model.

4 Training Details
For training, we initialize

This allows the output to be the same as input.
5 Results, as shown in Figure. The insertion of spatial transformers into a classification network enables the system to learn how to attend and transform the input.
6 References
参考文献[1]可通过访问该视频获得:https://www.youtube.com/watch?v=Ywv0Xi2-14Y.
参考文献[2]可通过访问此视频获取:https://www.youtube.com/watch?v=T5k0GnBmZVI.

