深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

阅读量：

TResNet: A State-of-the-Art Architecture for GPU-Accelerated Deep Learning
PDF available at https://arxiv.org/abs/2003.13630.pdf
PyTorch implementation can be found at https://github.com/shanglianlm0525/PyTorch-Networks

1 概述

TResNet架构展现了显著的计算效率与分类精度优势。基于TResNet架构及其与ResNet50类似的GPU吞吐量特性，在ImageNet数据集上该研究团队以80.7%的top-1准确率完成了这一目标任务

2 TResNet Design

2-1 Stem Design

复制代码

    class SpaceToDepth(nn.Module):
    def __init__(self, block_size=4):
        super().__init__()
        assert block_size == 4
        self.bs = block_size
    
    def forward(self, x):
        N, C, H, W = x.size()
        x = x.view(N, C, H // self.bs, self.bs, W // self.bs, self.bs)  # (N, C, H//bs, bs, W//bs, bs)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # (N, bs, bs, C, H//bs, W//bs)
        x = x.view(N, C * (self.bs ** 2), H // self.bs, W // self.bs)  # (N, C*bs^2, H//bs, W//bs)
        return x
    
    
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

These authors explore the balance between non-discriminative data and model resolution. Their study delves into the relative importance of these factors, questioning whether non-discriminative data or a weak model holds greater significance in achieving optimal results. The research is published as a preprint on arXiv: 1909.03205, appearing in the year 2019.

2-2 Anti-Alias Downsampling (AA)

复制代码

    class AADownsample(nn.Module):
    def __init__(self, filt_size=3, stride=2, channels=None):
        super(AADownsample, self).__init__()
        self.filt_size = filt_size
        self.stride = stride
        self.channels = channels
    
    
        assert self.filt_size == 3
        a = torch.tensor([1., 2., 1.])
    
        filt = (a[:, None] * a[None, :])
        filt = filt / torch.sum(filt)
    
        # self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1))
        self.register_buffer('filt', filt[None, None, :, :].repeat((self.channels, 1, 1, 1)))
    
    def forward(self, input):
        input_pad = F.pad(input, (1, 1, 1, 1), 'reflect')
        return F.conv2d(input_pad, self.filt, stride=self.stride, padding=0, groups=input.shape[1])
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

Richard Zhang. Revisiting the shift-invariance of convolutional networks. In the proceedings of ICML in 2019.

2-3 In-Place Activated BatchNorm (Inplace-ABN)

采用Inplace-ABN层取代传统的BatchNorm+ReLU层配置, 这种设计能够显著降低GPU内存占用需求
另外,采用Leaky-ReLU激活函数替代ReLU激活函数不仅可以提升模型性能
而且还能保持较低的计算开销

https://github.com/mapillary/inplace_abn

The authors include Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. This method is characterized by in-place activation of batchnorm for memory-optimized training of deep neural networks. It appeared in the proceedings of the IEEE Conference on Computer Vision and Pattern Recognition in 2018.

2-4 Blocks Selection

下图左侧展示了ResNet34所采用的基本残差块（BasicBlock）的架构设计，右侧则体现了ResNet50所采用的瓶颈残差块（Bottleneck）的构造特点。相比之下，在GPU资源利用方面具有显著优势的瓶颈结构，在保持较高计算效率的同时仍能实现更高的模型精度水平；另一方面，则由于其更广大的网络覆盖范围而体现出基本残差块的独特优势。基于此特性分析,TResNet主要体现在其网络架构设计上：即在前两个阶段主要采用基本残差块进行特征提取,而在后两个阶段则切换至瓶颈残差块以提升模型性能

2-5 SE Layers

在前三阶段增加SE layers, 同时SE layers位置如下

提出的结构如下

3 Code Optimizations

3-1 JIT Compilation

JIT accelerated SpaceToDepth module

复制代码

    @torch.jit.script
    class SpaceToDepthJit(object):
    def __call__(self, x: torch.Tensor):
        # assuming hard-coded that block_size==4 for acceleration
        N, C, H, W = x.size()
        x = x.view(N, C, H // 4, 4, W // 4, 4)  # (N, C, H//bs, bs, W//bs, bs)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # (N, bs, bs, C, H//bs, W//bs)
        x = x.view(N, C * 16, H // 4, W // 4)  # (N, C*bs^2, H//bs, W//bs)
        return x
    
    
      
      
      
      
      
      
      
      
      
    
    代码解读

JIT accelerated AA downsampling module

复制代码

    @torch.jit.script
    class AADownsampleJIT(object):
    def __init__(self, filt_size: int = 3, stride: int = 2, channels: int = 0):
        self.stride = stride
        self.filt_size = filt_size
        self.channels = channels
    
        assert self.filt_size == 3
        assert stride == 2
        a = torch.tensor([1., 2., 1.])
    
        filt = (a[:, None] * a[None, :]).clone().detach()
        filt = filt / torch.sum(filt)
        self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half()
    
    def __call__(self, input: torch.Tensor):
        if input.dtype != self.filt.dtype:
            self.filt = self.filt.float() 
        input_pad = F.pad(input, (1, 1, 1, 1), 'reflect')
        return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

3-2 Fixed Global Average Pooling

与自适应平均池化操作相比，AvgPool2d在速度上表现更为突出。然而，在结合View和Mean函数后，速度提升显著——达到了五倍于AvgPool2d的效果。

复制代码

    class FastGlobalAvgPool2d(nn.Module):
    def __init__(self, flatten=False):
        super(FastGlobalAvgPool2d, self).__init__()
        self.flatten = flatten
    
    def forward(self, x):
        if self.flatten:
            in_size = x.size()
            return x.view((in_size[0], in_size[1], -1)).mean(dim=2)
        else:
            return x.view(x.size(0), x.size(1), -1).mean(-1).view(x.size(0), x.size(1), 1, 1)
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

3-3 Inplace Operations

在各个适用的位置上,广泛地采用 inplace operations 如 residual connections、SE layers 和 blocks' final activations 等

4 实验结果

4-1 Basic

4-2 Ablation

全部评论 (0)

还没有任何评论哟~

深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

TResNet:HighPerformanceGPUDedicatedArchitecture PDF:<https://arxiv.org/abs/2003.13630.pdf PyTorch:<h...

深度学习论文: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search及其PyTorch实现

深度学习论文:TurboViT:GeneratingFastVisionTransformersviaGenerativeArchitectureSearch及其PyTorch实现 TurboViT:...

深度学习论文: FemtoDet: An Object Detection Baseline for Energy Versus Performance Tradeoffs及其PyTorch实现

深度学习论文:FemtoDet:AnObjectDetectionBaselineforEnergyVersusPerformanceTradeoffs及其PyTorch实现 FemtoDet:AnO...

深度学习论文: Compounding the Performance Improvements of Assembled Techniques in a CNN及其PyTorch实现

深度学习论文:CompoundingthePerformanceImprovementsofAssembledTechniquesinaConvolutionalNeuralNetwork及其PyTo...

深度学习论文:Dynamic ReLU及其PyTorch实现

DynamicReLU PDF:<https://arxiv.org/pdf/2003.10027.pdf PyTorch代码:<https://github.com/shanglianlm0525/...

深度学习论文: ICNet for Real-Time Semantic Segmentation on High-Resolution Images及其PyTorch实现

ICNet: ICNetforRealTimeSemanticSegmentationonHighResolutionImages2018 PDF:<https://arxiv.org/pdf/170...

深度学习论文: Attentional Feature Fusion及其PyTorch实现

深度学习论文:AttentionalFeatureFusion及其PyTorch实现 AttentionalFeatureFusion PDF:<https://arxiv.org/pdf/2009....

深度学习论文: Selective Kernel Networks及其PyTorch实现

SelectiveKernelNetworks PDF:<https://arxiv.org/pdf/1903.06586.pdf PyTorch:<https://github.com/implus...

深度学习论文:Deformable Convolutional Networks及其PyTorch实现

DeformableConvolutionalNetworks PDF:<https://arxiv.org/pdf/1703.06211.pdf PyTorch代码:<https://github....

深度学习论文: YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs及其PyTorch实现

深度学习论文:YOLOReT:TowardsHighAccuracyRealtimeObjectDetectiononEdgeGPUs及其PyTorch实现 YOLOReT:TowardsHighAc...

是否确定退出登录?

深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

1 概述

2 TResNet Design

2-1 Stem Design

2-2 Anti-Alias Downsampling (AA)

2-3 In-Place Activated BatchNorm (Inplace-ABN)

2-4 Blocks Selection

2-5 SE Layers

3 Code Optimizations

3-1 JIT Compilation

3-2 Fixed Global Average Pooling

3-3 Inplace Operations

4 实验结果

4-1 Basic

4-2 Ablation

全部评论 (0)

相关文章推荐

深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

深度学习论文: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search及其PyTorch实现

深度学习论文: FemtoDet: An Object Detection Baseline for Energy Versus Performance Tradeoffs及其PyTorch实现

深度学习论文: Compounding the Performance Improvements of Assembled Techniques in a CNN及其PyTorch实现

深度学习论文:Dynamic ReLU及其PyTorch实现

深度学习论文: ICNet for Real-Time Semantic Segmentation on High-Resolution Images及其PyTorch实现

深度学习论文: Attentional Feature Fusion及其PyTorch实现

深度学习论文: Selective Kernel Networks及其PyTorch实现

深度学习论文:Deformable Convolutional Networks及其PyTorch实现

深度学习论文: YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs及其PyTorch实现