Advertisement

[ResMLP]ResMLP: Feedforward networks for image claissification with data-efficient training

阅读量:
image-20210511212052426

文章目录

  • 1. Contribution

  • 2. Summary

  • 3. Methods

    • 3.1 The overall ResMLP architecture
    • 3.2 The Residual Multi-Perceptron Layer
    • 3.3 Relationship to the Vision Transformer.
    • 3.4 Class-MLP: MLP with class embedding
  • 4. Experiment

    • 4.1 Comparison with Transformers and convnets in a supervised setting
    • 4.2 Improving model convergence with knowledge distillation
    • 4.3 Transfer learning
    • 4.4 Ablation Experiments
      • 4.4.1 Activation
      • 4.4.2 Normalization
  • 5. pseduo-pytorch-code

1. Contribution

本文提出Residual Multi-Layer Perceptrons (ResMLP)

We propose Residual Multi-Layer Perceptrons (ResMLP): a purely multi-layer perceptron (MLP) based architecture for image classification.

(i) a linear layer in which image patches interact, independently and identi- cally across channels,

and (ii) a two-layer feed-forward network in which channels interact independently per patch.

如图1是ResMLP architecture。简要流程如下,分为网络的输入,2个残差结构,分别是linear layer以及MLP with a single hidden layer,最后是一个average pool layer和a linear classifier:

  • it takes flattened patches as input, projects them with a linear layer, and sequentially updates them in turn with two residual operations:
  • (i) a simple linear layer that provides interaction between the patches, which is applied to all channels independently;
  • (ii) an MLP with a single hidden layer, which is independently applied to all patches.
  • At the end of the network, the patches are average pooled, and fed to a linear classifier.
    image-20210511213955186

2. Summary

  • ResMLP可以在只是用ImageNet-1k训练的情况下,得到一个精度和时间复杂度平衡。

Despite their simplicity, Residual Multi-Layer Perceptrons can reach surprisingly good accuracy/complexity trade-offs with ImageNet-1k training only1, without requiring normalization based on batch or channel statistics;

  • 通过蒸馏方法可以获得收益。

These models benefit significantly from distillation methods.

  • linear层的设计,有助于观察到网络通过层与层之间学习何种空间交互信息。

thank to its design where patch embeddings simply “communicate” through a linear layer, we can make observations on what kind of spatial interaction the network learns across layers.

3. Methods

3.1 The overall ResMLP architecture

a linear sbulayer followed by a feedforward sublayer.

输入NxN的patch,N为16, 过一个linear layer,这些patches最终送入一个线性层来得到NxNxd的 patch embeddings。

首先将NxNxC embedding信息送入ResMLP层中,来得到NxNxd的output embedding。 224 x 224 x 3 --> 16x16x (14 x 14 x 3) --> 16 x 16 x d

最终NxNxd的vector经过average pool,得到d维的vector,再融入linear classifier [d, C],得到网络的预测label,使用交叉熵loss 训练网络。

3.2 The Residual Multi-Perceptron Layer

作者使用Affine transformation替换了Transformer layer中的Layer Normalization,具体公式如下:
image-20210512112637134

其中,\alpha\beta都是learnable vectors。该操作只是简单地按比例缩放rescales 和平移shifts 输入分量。并且Affine 在inference上没有花销。

This operation simply rescales and shifts the input component-wise.

Moreover, it has no cost at inference time, as it can fused in the adjacent linear layer.

本文指出Affine和LayerScale method 很相似,当α初始化为很小值的时候,可以改善了deep transformer的性能,并且LayerScale是没有bias变量的。

设置了α=1,β=0。

总得说,MLP layer的输入为NxNxd的特征,用矩阵X,X \in {d \times N^2}表示,输出为NxNxd的特征,用矩阵Y表示,公式如下:
image-20210512114201315

其中,A,B,C都是可学习的参数。(也就是linear层而已),A矩阵维度A \in N^2 x N^2,也就是维度不变的linear变换;B和C的维度和Transformer layer的维度是相同的,分别是B \in {4d \times d}C \in {d \times 4d}

intermediate activation matrix Z的矩阵维度和X与Y的维度一样。

The main difference compared to a Transformer layer is that we replace the self-attention by the linear interaction defined in Eq. (2).

Linear interaction layer 和self-attention的区别:(数据是否相关)

While self-attention computes a convex combination of other features with coefficients that are data dependent, the linear interaction layer in Eq. (2) computes a general linear combination using learned coefficients that are not data dependent.

Linear path interaction layer 和 convolution layer的区别(全局和局部特征,是否共享权重):

个人认为共享权重应该指的是一个channel上的HxW的feature是否有不同的pixel使用相同的weight,对于FC来说,N =HxW上的pixel都对应的不同的weight组成的linear matrix,因此为文中所说的权重不共享。

As compared to a convolutional layers which have local support and share weights across space, our linear patch interaction layer offers a global support and does not share weights, moreover it is applied independently across channels.

3.3 Relationship to the Vision Transformer.

ResMLP无self-attention 模块。

We do not include any self-attention block. Instead we have a linear patch interac- tion layer without non-linearity.

ResMLP使用AVGPool代替了class token。

We do not have the extra “class” token that is typically used in these models to aggregate information via attention. Instead, we simply use average pooling. We do, however, also consider a specific aggregation layer as a variant, which we describe in the next paragraph.

ResMLP没有任何的position embedding,与MLP-Mixer相似,本文认为linear communication module隐式的考虑了patch position。

Similarly, we do not include any form of positional embedding: it is is not required as the linear communication module between patches implicitly takes into account the patch position.

本文使用简单可学习的affine transform代替了pre-LayerNormalization。

Instead of pre-LayerNormalization, we use a simple learnable affine transform, thus avoiding any form of batch and channel-wise statistics.

3.4 Class-MLP: MLP with class embedding

作为AVGPOOL的替代品,使用class-attention(CaiT中的方法),本文将attention-baesd interaction 用linear layers替换,可以增加性能(76.6–>77.5, 79.4 --> 79.9)。(在表4中的Pooling Layer中的class-MLP类型。)

4. Experiment

4.1 Comparison with Transformers and convnets in a supervised setting

image-20210512163042617

4.2 Improving model convergence with knowledge distillation

image-20210512163140467

4.3 Transfer learning

image-20210512164411560

4.4 Ablation Experiments

4.4.1 Activation

image-20210512164726974

4.4.2 Normalization

image-20210512164758142

5. pseduo-pytorch-code

在这里插入图片描述

全部评论 (0)

还没有任何评论哟~