[ResMLP]ResMLP: Feedforward networks for image claissification with data-efficient training

阅读量：

文章目录

1. Contribution
2. Summary
3. Methods
- 3.1 The overall ResMLP architecture
- 3.2 The Residual Multi-Perceptron Layer
- 3.3 Relationship to the Vision Transformer.
- 3.4 Class-MLP: MLP with class embedding
4. Experiment
- 4.1 Comparison with Transformers and convnets in a supervised setting
- 4.2 Improving model convergence with knowledge distillation
- 4.3 Transfer learning
- 4.4 Ablation Experiments
- - 4.4.1 Activation
  - 4.4.2 Normalization
5. pseduo-pytorch-code

1. Contribution

本文提出Residual Multi-Layer Perceptrons (ResMLP)

We propose Residual Multi-Layer Perceptrons (ResMLP): a purely multi-layer perceptron (MLP) based architecture for image classification.

(i) a linear layer in which image patches interact, independently and identi- cally across channels,

and (ii) a two-layer feed-forward network in which channels interact independently per patch.

如图1是ResMLP architecture。简要流程如下，分为网络的输入，2个残差结构，分别是linear layer以及MLP with a single hidden layer，最后是一个average pool layer和a linear classifier：

it takes flattened patches as input, projects them with a linear layer, and sequentially updates them in turn with two residual operations:
(i) a simple linear layer that provides interaction between the patches, which is applied to all channels independently;
(ii) an MLP with a single hidden layer, which is independently applied to all patches.
At the end of the network, the patches are average pooled, and fed to a linear classifier.

2. Summary

ResMLP可以在只是用ImageNet-1k训练的情况下，得到一个精度和时间复杂度平衡。

Despite their simplicity, Residual Multi-Layer Perceptrons can reach surprisingly good accuracy/complexity trade-offs with ImageNet-1k training only1, without requiring normalization based on batch or channel statistics;

通过蒸馏方法可以获得收益。

These models benefit significantly from distillation methods.

linear层的设计，有助于观察到网络通过层与层之间学习何种空间交互信息。

thank to its design where patch embeddings simply “communicate” through a linear layer, we can make observations on what kind of spatial interaction the network learns across layers.

3. Methods

3.1 The overall ResMLP architecture

a linear sbulayer followed by a feedforward sublayer.

输入NxN的patch，N为16，过一个linear layer，这些patches最终送入一个线性层来得到NxNxd的 patch embeddings。

首先将NxNxC embedding信息送入ResMLP层中，来得到NxNxd的output embedding。 224 x 224 x 3 --> 16x16x (14 x 14 x 3) --> 16 x 16 x d

最终NxNxd的vector经过average pool，得到d维的vector，再融入linear classifier [d, C]，得到网络的预测label，使用交叉熵loss 训练网络。

3.2 The Residual Multi-Perceptron Layer

作者使用Affine transformation替换了Transformer layer中的Layer Normalization，具体公式如下：

其中， $\alpha$ 和 $\beta$ 都是learnable vectors。该操作只是简单地按比例缩放rescales 和平移shifts 输入分量。并且Affine 在inference上没有花销。

This operation simply rescales and shifts the input component-wise.

Moreover, it has no cost at inference time, as it can fused in the adjacent linear layer.

本文指出Affine和LayerScale method 很相似，当α初始化为很小值的时候，可以改善了deep transformer的性能，并且LayerScale是没有bias变量的。

设置了α=1，β=0。

总得说，MLP layer的输入为NxNxd的特征，用矩阵X， $X \in {d \times N^2}$ 表示，输出为NxNxd的特征，用矩阵Y表示，公式如下：

其中，A，B，C都是可学习的参数。（也就是linear层而已），A矩阵维度 $A \in N^2 x N^2$ ，也就是维度不变的linear变换；B和C的维度和Transformer layer的维度是相同的，分别是 $B \in {4d \times d}$ ， $C \in {d \times 4d}$ 。

intermediate activation matrix Z的矩阵维度和X与Y的维度一样。

The main difference compared to a Transformer layer is that we replace the self-attention by the linear interaction defined in Eq. (2).

Linear interaction layer 和self-attention的区别：(数据是否相关)

While self-attention computes a convex combination of other features with coefficients that are data dependent, the linear interaction layer in Eq. (2) computes a general linear combination using learned coefficients that are not data dependent.

Linear path interaction layer 和 convolution layer的区别（全局和局部特征，是否共享权重）：

个人认为共享权重应该指的是一个channel上的HxW的feature是否有不同的pixel使用相同的weight，对于FC来说，N =HxW上的pixel都对应的不同的weight组成的linear matrix，因此为文中所说的权重不共享。

As compared to a convolutional layers which have local support and share weights across space, our linear patch interaction layer offers a global support and does not share weights, moreover it is applied independently across channels.

3.3 Relationship to the Vision Transformer.

ResMLP无self-attention 模块。

We do not include any self-attention block. Instead we have a linear patch interac- tion layer without non-linearity.

ResMLP使用AVGPool代替了class token。

We do not have the extra “class” token that is typically used in these models to aggregate information via attention. Instead, we simply use average pooling. We do, however, also consider a specific aggregation layer as a variant, which we describe in the next paragraph.

ResMLP没有任何的position embedding，与MLP-Mixer相似，本文认为linear communication module隐式的考虑了patch position。

Similarly, we do not include any form of positional embedding: it is is not required as the linear communication module between patches implicitly takes into account the patch position.

本文使用简单可学习的affine transform代替了pre-LayerNormalization。

Instead of pre-LayerNormalization, we use a simple learnable affine transform, thus avoiding any form of batch and channel-wise statistics.

3.4 Class-MLP: MLP with class embedding

作为AVGPOOL的替代品，使用class-attention（CaiT中的方法），本文将attention-baesd interaction 用linear layers替换，可以增加性能（76.6–>77.5， 79.4 --> 79.9）。（在表4中的Pooling Layer中的class-MLP类型。）

4. Experiment

4.1 Comparison with Transformers and convnets in a supervised setting

4.2 Improving model convergence with knowledge distillation

4.3 Transfer learning

4.4 Ablation Experiments

4.4.1 Activation

4.4.2 Normalization

5. pseduo-pytorch-code

全部评论 (0)

还没有任何评论哟~

[ResMLP]ResMLP: Feedforward networks for image claissification with data-efficient training

文章目录 1\.Contribution 2\.Summary 3\.Methods 3.1TheoverallResMLParchitecture 3.2TheResidualMultiPercep...

ResMLP: Feedforward networks for image classification with data-efficient training

本文提出MLP视觉新方法，构建了一个超级简单的残差架构，其残差块由一个隐藏层的前馈网络和一个线性patch交互层组成！当采用现代的训练方法，则可以在ImageNet上实现意想不到的高性能！代码即将开源...

Training data-efficient image transformers & distillation through attention

本视觉Transformers（86M参数）在ImageNet上达到83.1％的top1精度，蒸馏版本高达84.4%！优于ViT、RegNet和ResNet等，代码刚刚开源！注：文末附【Transf...

Training data-efficient image transformers & distillation through attention

Trainingdataefficientimagetransformers&distillationthroughattention 论文地址]https://arxiv.org/abs/2012....

Differentiable Augmentation for Data-Efficient GAN Training

DifferentiableAugmentationforDataEfficientGANTraining 在训练数据量有限的情况下，生成式对抗网络（GANs）的性能会严重恶化。这主要是因为判别器正在...

Self-supervised driven consistency training for annotation efficient histopathology image analysis

自监督驱动的一致性训练用于注释高效的组织病理学图像分析（2021CVPR） 0、摘要在这项工作中，我们通过利用基于两个新策略的任务诊断和特定任务的无标签数据来克服这一挑战：i）一个自监督的预设任务，...

Cluster-GCN: An Efficient Algorithm for Training Deep andLarge Graph Convolutional Networks

论文来源：InThe25thACMSIGKDDConferenceonKnowledgeDiscoveryandDataMiningKDD’19,August4–8,2019,Anchorage,AK...

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

神经网络的量化与训练 QuantizationandTrainingofNeuralNetworksforEfficientIntegerArithmeticOnlyInference（高效整数算术推...

Training Shallow and Thin Networks for Acceleration via KD with Conditional Adversarial Networks

论文地址：<https://arxiv.org/abs/1709.00513 这是2017年的一篇文章。在本文之前的监督学习的方法一般是通过最小化确定的损失函数来拟合学生和老师，本文通过cGANs来学...

Conversion of Continuous-Valued Deep Networks to Efficient Event-Driven Networks for Image Classific

从深层网络到有效的事件驱动网络，对于图片分类这篇文章展现出脉冲等价于这些操作，因此循序任何CNN结构的转化为SNN. snn可以在分类错误率和可用操作数量之间进行权衡，而深度连续值神经网络则需要固定...

是否确定退出登录?

[ResMLP]ResMLP: Feedforward networks for image claissification with data-efficient training

文章目录

1. Contribution

2. Summary

3. Methods

3.1 The overall ResMLP architecture

3.2 The Residual Multi-Perceptron Layer

3.3 Relationship to the Vision Transformer.

3.4 Class-MLP: MLP with class embedding

4. Experiment

4.1 Comparison with Transformers and convnets in a supervised setting

4.2 Improving model convergence with knowledge distillation

4.3 Transfer learning

4.4 Ablation Experiments

4.4.1 Activation

4.4.2 Normalization

5. pseduo-pytorch-code

全部评论 (0)

相关文章推荐

[ResMLP]ResMLP: Feedforward networks for image claissification with data-efficient training

ResMLP: Feedforward networks for image classification with data-efficient training

Training data-efficient image transformers & distillation through attention

Training data-efficient image transformers & distillation through attention

Differentiable Augmentation for Data-Efficient GAN Training

Self-supervised driven consistency training for annotation efficient histopathology image analysis

Cluster-GCN: An Efficient Algorithm for Training Deep andLarge Graph Convolutional Networks

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Training Shallow and Thin Networks for Acceleration via KD with Conditional Adversarial Networks

Conversion of Continuous-Valued Deep Networks to Efficient Event-Driven Networks for Image Classific