YOLOv11改进 | 主干/Backbone篇 | RepViT从视觉变换器（ViT）的视角重新审视CNN的目标检测网络（适配yolov11全系列）

阅读量：

一、本文介绍

本文为各位带来的改进方案是RepViT ，它被用来替代我们原有主干网络体系。作为今年推出的主要创新成果之一，RepViT的核心理念在于借鉴轻量化视觉变换器（ViT）的技术架构到传统的卷积神经网络（CNN）设计中。我选择将其应用于YOLOv11模型的Backbone部分，并通过这一改动实现了显著的参数规模提升。在实验环节中，我采用了最轻量化版本的修改后模型，在包含1000张图片的大样本数据集上进行了训练测试（涵盖大中小各类检测目标），最终观察到了各目标类别均呈现不同程度的性能提升效果。为了直观展现改动效果，在此附上基础版本与优化版本的对比图示。

（本文内容可根据yolov11的N、S、M、L、X进行二次缩放，轻量化更上一层）。

专栏回顾：YOLOv11改进系列专栏——本专栏持续复习各种顶会内容——科研必备****

一、本文介绍

二、RepViT基本原理

三、RepViT的核心代码

四、手把手教你添加RepViT网络结构

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

注意！！！额外的修改！

打印计算量问题解决方案

注意事项！！！

五、RepViT的yaml文件

5.1 RepViT的yaml文件版本1

5.2 训练文件

六、成功运行记录

七、本文总结

二、RepViT基本原理

正式论文链接直接跳转至

该处的官方代码仓库可通过以下链接访问：https://github.com/THU-MIG/RepViT/blob/main/model/repvit.py

本研究从视觉变换器视角重新审视了MobileNetV3等轻量化卷积神经网络（CNN）模型。论文指出，在当前移动设备应用中存在两个关键挑战：一是现有轻量化CNN模型性能有限；二是现有的轻量化视觉变换器（ViT）虽然能有效学习全局特征但仍存在不足。针对这一问题背景作者团队提出了一种创新性解决方案即通过融合轻量化ViT高效架构设计策略在MobileNetV3基础上逐步优化最终构建出一系列全新的纯CNN架构命名为RepViT系列。实验结果表明这些新构建的RepViT系列模型在多项视觉任务上均展现出超越现有最先进阶的轻量化视觉变换器性能

其主要的改进机制包括：

结构重构：通过结构重构（Structural Re-parameterization, SR），构建多分支拓扑架构，并显著提升模型训练效率。

网络扩展率调节：通过优化卷积层的扩张率设置，在降低计算复杂度及传输延迟的同时, 增加网络深度以提升整体效能。

宏观设计优化：通过系统性优化网络的整体架构实现性能提升，在具体实施过程中主要涉及多个关键模块的设计工作。其中包含关键的卷积层设计以及较深的下采样结构，并采用结构更为简洁的分类器设计方案同时对各阶段的比例关系进行了精细调节以达到整体性能的最佳平衡。

具体微调策略：从微观层面上进行优化工作，在卷积核尺寸的确定以及SE模块的最佳位置布置方面做出改进。

这些创新机制协同作用于轻量级CNN的性能与效率提升, 从而使其更适用于移动设备环境; 如图所示为官方论文中的结构图, 我们对其进行初步了解.

在论文中展示的这张图片是图3，并全面呈现了该架构的整体设计。该架构分为四个主要阶段：第一阶段对输入图像进行预处理；第二阶段通过多尺度特征提取模块逐步提升图像分辨率；第三阶段引入可学习的位置编码机制；第四阶段完成最终的整体重建工作。

每个阶段的通道维度用 Ci 表示，批处理大小用 B 表示。

Stem ：图像预处理组件。

Stage1-4 ：各阶段均包含多个RepViTBlock，并可选配置一个RepViTSEBlock（含3x3深度可分离卷积、1x1卷积、压缩激励模块与前馈网络），通过逐级下采样缩减空间维度。
Pooling ：执行全局平均池化操作以降低特征图的空间维度。
FC ：采用线性变换层输出分类结果。

总结： 大家可以将RepViT看成是MobileNet系列的改进版本

三、RepViT的核心代码

此代码即为RepViT的核心模块，在该系统中存在某个版本的实现细节与计算复杂度指标均不相同；具体应用方法参考第四章。

复制代码

 from symbol import factor

    
  
    
 import torch.nn as nn
    
 from timm.models.layers import SqueezeExcite
    
 import torch
    
  
    
 __all__ = ['repvit_m0_6','repvit_m0_9', 'repvit_m1_0', 'repvit_m1_1', 'repvit_m1_5', 'repvit_m2_3']
    
  
    
 def _make_divisible(v, divisor, min_value=None):
    
     """
    
     This function is taken from the original tf repo.
    
     It ensures that all layers have a channel number that is divisible by 8
    
     It can be seen here:
    
     https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    
     :param v:
    
     :param divisor:
    
     :param min_value:
    
     :return:
    
     """
    
     if min_value is None:
    
     min_value = divisor
    
     new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    
     # Make sure that round down does not go down by more than 10%.
    
     if new_v < 0.9 * v:
    
     new_v += divisor
    
     return new_v
    
  
    
  
    
 class Conv2d_BN(torch.nn.Sequential):
    
     def __init__(self, a, b, ks=1, stride=1, pad=0, dilation=1,
    
              groups=1, bn_weight_init=1, resolution=-10000):
    
     super().__init__()
    
     self.add_module('c', torch.nn.Conv2d(
    
         a, b, ks, stride, pad, dilation, groups, bias=False))
    
     self.add_module('bn', torch.nn.BatchNorm2d(b))
    
     torch.nn.init.constant_(self.bn.weight, bn_weight_init)
    
     torch.nn.init.constant_(self.bn.bias, 0)
    
  
    
     @torch.no_grad()
    
     def fuse_self(self):
    
     c, bn = self._modules.values()
    
     w = bn.weight / (bn.running_var + bn.eps) ** 0.5
    
     w = c.weight * w[:, None, None, None]
    
     b = bn.bias - bn.running_mean * bn.weight / \
    
         (bn.running_var + bn.eps) ** 0.5
    
     m = torch.nn.Conv2d(w.size(1) * self.c.groups, w.size(
    
         0), w.shape[2:], stride=self.c.stride, padding=self.c.padding, dilation=self.c.dilation,
    
                         groups=self.c.groups,
    
                         device=c.weight.device)
    
     m.weight.data.copy_(w)
    
     m.bias.data.copy_(b)
    
     return m
    
  
    
  
    
 class Residual(torch.nn.Module):
    
     def __init__(self, m, drop=0.):
    
     super().__init__()
    
     self.m = m
    
     self.drop = drop
    
  
    
     def forward(self, x):
    
     if self.training and self.drop > 0:
    
         return x + self.m(x) * torch.rand(x.size(0), 1, 1, 1,
    
                                           device=x.device).ge_(self.drop).div(1 - self.drop).detach()
    
     else:
    
         return x + self.m(x)
    
  
    
     @torch.no_grad()
    
     def fuse_self(self):
    
     if isinstance(self.m, Conv2d_BN):
    
         m = self.m.fuse_self()
    
         assert (m.groups == m.in_channels)
    
         identity = torch.ones(m.weight.shape[0], m.weight.shape[1], 1, 1)
    
         identity = torch.nn.functional.pad(identity, [1, 1, 1, 1])
    
         m.weight += identity.to(m.weight.device)
    
         return m
    
     elif isinstance(self.m, torch.nn.Conv2d):
    
         m = self.m
    
         assert (m.groups != m.in_channels)
    
         identity = torch.ones(m.weight.shape[0], m.weight.shape[1], 1, 1)
    
         identity = torch.nn.functional.pad(identity, [1, 1, 1, 1])
    
         m.weight += identity.to(m.weight.device)
    
         return m
    
     else:
    
         return self
    
  
    
  
    
 class RepVGGDW(torch.nn.Module):
    
     def __init__(self, ed) -> None:
    
     super().__init__()
    
     self.conv = Conv2d_BN(ed, ed, 3, 1, 1, groups=ed)
    
     self.conv1 = torch.nn.Conv2d(ed, ed, 1, 1, 0, groups=ed)
    
     self.dim = ed
    
     self.bn = torch.nn.BatchNorm2d(ed)
    
  
    
     def forward(self, x):
    
     return self.bn((self.conv(x) + self.conv1(x)) + x)
    
  
    
     @torch.no_grad()
    
     def fuse_self(self):
    
     conv = self.conv.fuse_self()
    
     conv1 = self.conv1
    
  
    
     conv_w = conv.weight
    
     conv_b = conv.bias
    
     conv1_w = conv1.weight
    
     conv1_b = conv1.bias
    
  
    
     conv1_w = torch.nn.functional.pad(conv1_w, [1, 1, 1, 1])
    
  
    
     identity = torch.nn.functional.pad(torch.ones(conv1_w.shape[0], conv1_w.shape[1], 1, 1, device=conv1_w.device),
    
                                        [1, 1, 1, 1])
    
  
    
     final_conv_w = conv_w + conv1_w + identity
    
     final_conv_b = conv_b + conv1_b
    
  
    
     conv.weight.data.copy_(final_conv_w)
    
     conv.bias.data.copy_(final_conv_b)
    
  
    
     bn = self.bn
    
     w = bn.weight / (bn.running_var + bn.eps) ** 0.5
    
     w = conv.weight * w[:, None, None, None]
    
     b = bn.bias + (conv.bias - bn.running_mean) * bn.weight / \
    
         (bn.running_var + bn.eps) ** 0.5
    
     conv.weight.data.copy_(w)
    
     conv.bias.data.copy_(b)
    
     return conv
    
  
    
  
    
 class RepViTBlock(nn.Module):
    
     def __init__(self, inp, hidden_dim, oup, kernel_size, stride, use_se, use_hs):
    
     super(RepViTBlock, self).__init__()
    
     assert stride in [1, 2]
    
  
    
     self.identity = stride == 1 and inp == oup
    
     assert (hidden_dim == 2 * inp)
    
  
    
     if stride == 2:
    
         self.token_mixer = nn.Sequential(
    
             Conv2d_BN(inp, inp, kernel_size, stride, (kernel_size - 1) // 2, groups=inp),
    
             SqueezeExcite(inp, 0.25) if use_se else nn.Identity(),
    
             Conv2d_BN(inp, oup, ks=1, stride=1, pad=0)
    
         )
    
         self.channel_mixer = Residual(nn.Sequential(
    
             # pw
    
             Conv2d_BN(oup, 2 * oup, 1, 1, 0),
    
             nn.GELU() if use_hs else nn.GELU(),
    
             # pw-linear
    
             Conv2d_BN(2 * oup, oup, 1, 1, 0, bn_weight_init=0),
    
         ))
    
     else:
    
         self.token_mixer = nn.Sequential(
    
             RepVGGDW(inp),
    
             SqueezeExcite(inp, 0.25) if use_se else nn.Identity(),
    
         )
    
         self.channel_mixer = Residual(nn.Sequential(
    
             # pw
    
             Conv2d_BN(inp, hidden_dim, 1, 1, 0),
    
             nn.GELU() if use_hs else nn.GELU(),
    
             # pw-linear
    
             Conv2d_BN(hidden_dim, oup, 1, 1, 0, bn_weight_init=0),
    
         ))
    
  
    
     def forward(self, x):
    
     return self.channel_mixer(self.token_mixer(x))
    
  
    
  
    
 class RepViT(nn.Module):
    
     def __init__(self, cfgs, factor):
    
     super(RepViT, self).__init__()
    
     # setting of inverted residual blocks
    
     cfgs = [sublist[:2] + [_make_divisible(int(sublist[2] * factor) , 8)] + sublist[3:] for sublist in cfgs]
    
     self.cfgs = cfgs
    
     # building first layer
    
     input_channel = self.cfgs[0][2]
    
     patch_embed = torch.nn.Sequential(Conv2d_BN(3, input_channel // 2, 3, 2, 1), torch.nn.GELU(),
    
                                       Conv2d_BN(input_channel // 2, input_channel, 3, 2, 1))
    
     layers = [patch_embed]
    
     # building inverted residual blocks
    
     block = RepViTBlock
    
     for k, t, c, use_se, use_hs, s in self.cfgs:
    
         output_channel = _make_divisible(c , 8)
    
         exp_size = _make_divisible(input_channel * t, 8)
    
         layers.append(block(input_channel, exp_size, output_channel, k, s, use_se, use_hs))
    
         input_channel = output_channel
    
     self.features = nn.ModuleList(layers)
    
     self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
    
  
    
     def forward(self, x):
    
     # x = self.features(x
    
     results = [None, None, None, None]
    
     temp = None
    
     i = None
    
     for index, f in enumerate(self.features):
    
         x = f(x)
    
         if index == 0:
    
             temp = x.size(1)
    
             i = 0
    
         elif x.size(1) == temp:
    
             results[i] = x
    
         else:
    
             temp = x.size(1)
    
             i = i + 1
    
     return results
    
  
    
  
    
  
    
 def repvit_m0_6(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     [3,   2,  40, 1, 0, 1],
    
     [3,   2,  40, 0, 0, 1],
    
     [3,   2,  80, 0, 0, 2],
    
     [3,   2,  80, 1, 0, 1],
    
     [3,   2,  80, 0, 0, 1],
    
     [3,   2,  160, 0, 1, 2],
    
     [3,   2, 160, 1, 1, 1],
    
     [3,   2, 160, 0, 1, 1],
    
     [3,   2, 160, 1, 1, 1],
    
     [3,   2, 160, 0, 1, 1],
    
     [3,   2, 160, 1, 1, 1],
    
     [3,   2, 160, 0, 1, 1],
    
     [3,   2, 160, 1, 1, 1],
    
     [3,   2, 160, 0, 1, 1],
    
     [3,   2, 160, 0, 1, 1],
    
     [3,   2, 320, 0, 1, 2],
    
     [3,   2, 320, 1, 1, 1],
    
     ]
    
     model = RepViT(cfgs, factor)
    
     return model
    
  
    
  
    
 def repvit_m0_9(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     # k, t, c, SE, HS, s
    
     [3, 2, 48, 1, 0, 1],
    
     [3, 2, 48, 0, 0, 1],
    
     [3, 2, 48, 0, 0, 1],
    
     [3, 2, 96, 0, 0, 2],
    
     [3, 2, 96, 1, 0, 1],
    
     [3, 2, 96, 0, 0, 1],
    
     [3, 2, 96, 0, 0, 1],
    
     [3, 2, 192, 0, 1, 2],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 1, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 192, 0, 1, 1],
    
     [3, 2, 384, 0, 1, 2],
    
     [3, 2, 384, 1, 1, 1],
    
     [3, 2, 384, 0, 1, 1]
    
     ]
    
     model = RepViT(cfgs, factor)
    
     return model
    
  
    
  
    
 def repvit_m1_0(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     # k, t, c, SE, HS, s
    
     [3, 2, 56, 1, 0, 1],
    
     [3, 2, 56, 0, 0, 1],
    
     [3, 2, 56, 0, 0, 1],
    
     [3, 2, 112, 0, 0, 2],
    
     [3, 2, 112, 1, 0, 1],
    
     [3, 2, 112, 0, 0, 1],
    
     [3, 2, 112, 0, 0, 1],
    
     [3, 2, 224, 0, 1, 2],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 1, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 224, 0, 1, 1],
    
     [3, 2, 448, 0, 1, 2],
    
     [3, 2, 448, 1, 1, 1],
    
     [3, 2, 448, 0, 1, 1]
    
     ]
    
     model = RepViT(cfgs,factor=factor)
    
     return model
    
  
    
  
    
 def repvit_m1_1(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     # k, t, c, SE, HS, s
    
     [3, 2, 64, 1, 0, 1],
    
     [3, 2, 64, 0, 0, 1],
    
     [3, 2, 64, 0, 0, 1],
    
     [3, 2, 128, 0, 0, 2],
    
     [3, 2, 128, 1, 0, 1],
    
     [3, 2, 128, 0, 0, 1],
    
     [3, 2, 128, 0, 0, 1],
    
     [3, 2, 256, 0, 1, 2],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 512, 0, 1, 2],
    
     [3, 2, 512, 1, 1, 1],
    
     [3, 2, 512, 0, 1, 1]
    
     ]
    
     model = RepViT(cfgs,factor=factor)
    
     return model
    
  
    
  
    
 def repvit_m1_5(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     # k, t, c, SE, HS, s
    
     [3, 2, 64, 1, 0, 1],
    
     [3, 2, 64, 0, 0, 1],
    
     [3, 2, 64, 1, 0, 1],
    
     [3, 2, 64, 0, 0, 1],
    
     [3, 2, 64, 0, 0, 1],
    
     [3, 2, 128, 0, 0, 2],
    
     [3, 2, 128, 1, 0, 1],
    
     [3, 2, 128, 0, 0, 1],
    
     [3, 2, 128, 1, 0, 1],
    
     [3, 2, 128, 0, 0, 1],
    
     [3, 2, 128, 0, 0, 1],
    
     [3, 2, 256, 0, 1, 2],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 1, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 256, 0, 1, 1],
    
     [3, 2, 512, 0, 1, 2],
    
     [3, 2, 512, 1, 1, 1],
    
     [3, 2, 512, 0, 1, 1],
    
     [3, 2, 512, 1, 1, 1],
    
     [3, 2, 512, 0, 1, 1]
    
     ]
    
     model = RepViT(cfgs,factor=factor)
    
     return model
    
  
    
  
    
 def repvit_m2_3(factor):
    
     """
    
     Constructs a MobileNetV3-Large model
    
     """
    
     cfgs = [
    
     # k, t, c, SE, HS, s
    
     [3, 2, 80, 1, 0, 1],
    
     [3, 2, 80, 0, 0, 1],
    
     [3, 2, 80, 1, 0, 1],
    
     [3, 2, 80, 0, 0, 1],
    
     [3, 2, 80, 1, 0, 1],
    
     [3, 2, 80, 0, 0, 1],
    
     [3, 2, 80, 0, 0, 1],
    
     [3, 2, 160, 0, 0, 2],
    
     [3, 2, 160, 1, 0, 1],
    
     [3, 2, 160, 0, 0, 1],
    
     [3, 2, 160, 1, 0, 1],
    
     [3, 2, 160, 0, 0, 1],
    
     [3, 2, 160, 1, 0, 1],
    
     [3, 2, 160, 0, 0, 1],
    
     [3, 2, 160, 0, 0, 1],
    
     [3, 2, 320, 0, 1, 2],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 320, 1, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     # [3,   2, 320, 1, 1, 1],
    
     # [3,   2, 320, 0, 1, 1],
    
     [3, 2, 320, 0, 1, 1],
    
     [3, 2, 640, 0, 1, 2],
    
     [3, 2, 640, 1, 1, 1],
    
     [3, 2, 640, 0, 1, 1],
    
     # [3,   2, 640, 1, 1, 1],
    
     # [3,   2, 640, 0, 1, 1]
    
     ]
    
     model = RepViT(cfgs,factor=factor)
    
     return model
    
  
    
  
    
 if __name__ == '__main__':
    
     model = repvit_m0_6(factor=0.25)
    
     inputs = torch.randn((1, 3, 640, 640))
    
     for i in model(inputs):
    
     print(i.size())
    
    
    
    
    代码解读

四、手把手教你添加RepViT网络结构

4.1 修改一

首先需要创建一个名为'Addmodules'的子目录。这个子目录位于ultralytics/nn目录下（如果使用群内的资源已经存在，则无需新建）。随后，在这个新创建的py文件中粘贴核心代码块即可完成操作。

4.2 修改二

我们在此目录中建立了名为__init__.py的第一个Python文件。（若已使用本群内的文件，则无需另行新建）随后，在其中导入我们的检测头如图所示。

4.3 修改三

第三步我们到了如下文件'ultralytics/nn/tasks.py'来进行导入并注册我们的模块（若使用群内的文件则无需重新导入即可直接开始第四步）！

从今天起之后的所有课程都将采用统一的教学材料格式进行编排。按照惯例，在线平台上的学习资源通常由教师提供给学生。

4.4 修改四

添加如下两行代码！！！

4.5 修改五

在处理大约七百多行数据时，请参照图片中的信息进行分析；请根据图片中的信息进行相应的调整和优化；特别注意，在函数名处（即未包围括号的函数名称处）添加所需内容。

复制代码

     elif m in {自行添加对应的模型即可，下面都是一样的}:

    
         m = m(*args)
    
         c2 = m.width_list  # 返回通道列表
    
         backbone = True
    
    
    
    
    代码解读

4.6 修改六

下面的两个红框内都是需要改动的。

复制代码

     if isinstance(c2, list):

    
         m_ = m
    
         m_.backbone = True
    
     else:
    
         m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
    
         t = str(m)[8:-2].replace('__main__.', '')  # module type
    
  
    
  
    
     m.np = sum(x.numel() for x in m_.parameters())  # number params
    
     m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t  # attach index, 'from' index, type
    
    
    
    
    代码解读

4.7 修改七

如下的也需要修改，全部按照我的来。

代码如下把原先的代码替换了即可。

复制代码

     if verbose:

    
         LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f}  {t:<45}{str(args):<30}')  # print
    
  
    
     save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelist
    
     layers.append(m_)
    
     if i == 0:
    
         ch = []
    
     if isinstance(c2, list):
    
         ch.extend(c2)
    
         if len(c2) != 5:
    
             ch.insert(0, 0)
    
     else:
    
         ch.append(c2)
    
    
    
    
    代码解读

4.8 修改八

在优化过程中发现该参数与其他参数存在显著差异性特征。经过分析发现，在当前阶段的工作重点应集中于改进前向传播环节的相关计算逻辑，并且该参数已不再参与parse_model方法的调优。

在界面中显示代码行数时，请确保所有代码均位于同一个task.py文件内。此部分涉及多个前向传播过程，请务必注意它们彼此非常相似。具体来说，这一区域大约70多行左右的内容，请确保理解无误。我已提供完整的代码示例供参考，并计划以后录制相关教学视频以供学习者参考。

代码如下- >

复制代码

     def _predict_once(self, x, profile=False, visualize=False, embed=None):

    
     """
    
     Perform a forward pass through the network.
    
   5.         Args:
    
         x (torch.Tensor): The input tensor to the model.
    
         profile (bool):  Print the computation time of each layer if True, defaults to False.
    
         visualize (bool): Save the feature maps of the model if True, defaults to False.
    
         embed (list, optional): A list of feature vectors/embeddings to return.
    
   11.         Returns:
    
         (torch.Tensor): The last output of the model.
    
     """
    
     y, dt, embeddings = [], [], []  # outputs
    
     for m in self.model:
    
         if m.f != -1:  # if not from previous layer
    
             x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
    
         if profile:
    
             self._profile_one_layer(m, x, dt)
    
         if hasattr(m, 'backbone'):
    
             x = m(x)
    
             if len(x) != 5:  # 0 - 5
    
                 x.insert(0, None)
    
             for index, i in enumerate(x):
    
                 if index in self.save:
    
                     y.append(i)
    
                 else:
    
                     y.append(None)
    
             x = x[-1]  # 最后一个输出传给下一层
    
         else:
    
             x = m(x)  # run
    
             y.append(x if m.i in self.save else None)  # save output
    
         if visualize:
    
             feature_visualization(x, m.type, m.i, save_dir=visualize)
    
         if embed and m.i in embed:
    
             embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flatten
    
             if m.i == max(embed):
    
                 return torch.unbind(torch.cat(embeddings, 1), dim=0)
    
     return x
    
    
    
    
    代码解读

至此修改完毕已无遗漏。但其中涉及的细节繁多……（此处省略）（此处省略）（此处省略）且调试难度较大……（此处省略）

注意！！！额外的修改！

大家或许都已经 aware of this, 其中大部分操作具有相同的 nature。该网络在某些方面仍需 additional modifications, 主要关注 parameter s即可完成这一 task. 只需将该 parameter设置为 640 即可实现 optimal performance.

打印计算量问题解决方案

在项目中定位到文件'ultralytics/utils/torch_utils.py'并基于以下图示进行操作。若不采取上述措施则可能导致计算资源不足的问题

注意事项！！！

在进行测试时遇到图像尺寸不匹配的错误信息时，请确保验证集图片的尺寸设置正确。具体操作步骤如下：

在指定路径处ultralytics/models/yolo/detect/train.py一文中，在DetectionTrainer类的build_dataset函数中的一条参数被设定为rect=mode == 'val'的情况下，请将其修改为rect=False

五、RepViT的yaml文件

复制如下yaml文件进行运行！！！

5.1 RepViT的yaml文件版本1

此版本训练信息：YOLO11-RepViT summary: 559 layers, 2,118,115 parameters, 2,118,099 gradients, 5.4 GFLOPs

**使用说明：# 下面 [- $-$ $-$ LSKNet, $[$ [0. $\texttt{25}$ ]]] 参数位置上的 $[...]$ 值表示通道放缩因子,其中 $[...] = \texttt{YOLov} - [...]$ 分别对应不同的比例值: $[...] = \texttt{YOLov} - \texttt{N}$ 对应 $[...] = \texttt{YOLov} - \texttt{S}$ 对应 $[...] = \texttt{YOLov} - \texttt{M}$ 对应 $[...] = \texttt{YOLov} - l$ 对应 $[...] = \texttt{YOLov}$ 对应 $[...]=\texttt{YOLov}-\texttt{l}$ 的比例值分别为 $\frac{\sqrt{\pi}}{\sigma_{\mu}}$ ,具体数值需根据实际训练使用的 $\texttt{YOLov}$ 版本进行相应设置。建议根据实际训练使用的 $\texttt{YOLov}$ 版本进行相应设置。

# 本文支持版本有 all = ['repvit_m0_6','repvit_m0_9', 'repvit_m1_0', 'repvit_m1_1', 'repvit_m1_5', 'repvit_m2_3']

复制代码

 # Ultralytics YOLO 🚀, AGPL-3.0 license

    
 # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
    
  
    
 # Parameters
    
 nc: 80 # number of classes
    
 scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
    
   # [depth, width, max_channels]
    
   n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
    
   s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
    
   m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
    
   l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
    
   x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
    
  
    
 # 下面 [-1, 1, repvit_m0_6, [0.25]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv11是1.5大家根据自己训练的YOLO版本设定即可.
    
 # 本文支持版本有  __all__ = ['repvit_m0_6','repvit_m0_9', 'repvit_m1_0', 'repvit_m1_1', 'repvit_m1_5', 'repvit_m2_3']
    
 # YOLO11n backbone
    
 backbone:
    
   # [from, repeats, module, args]
    
   - [-1, 1, repvit_m0_6, [0.5]] # 0-4 P1/2 这里是四层大家不要被yaml文件限制住了思维，不会画图进群看视频.
    
   - [-1, 1, SPPF, [1024, 5]] # 5
    
   - [-1, 2, C2PSA, [1024]] # 6
    
  
    
 # YOLO11n head
    
 head:
    
   - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    
   - [[-1, 3], 1, Concat, [1]] # cat backbone P4
    
   - [-1, 2, C3k2, [512, False]] # 9
    
  
    
   - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
    
   - [[-1, 2], 1, Concat, [1]] # cat backbone P3
    
   - [-1, 2, C3k2, [256, False]] # 12 (P3/8-small)
    
  
    
   - [-1, 1, Conv, [256, 3, 2]]
    
   - [[-1, 9], 1, Concat, [1]] # cat head P4
    
   - [-1, 2, C3k2, [512, False]] # 15 (P4/16-medium)
    
  
    
   - [-1, 1, Conv, [512, 3, 2]]
    
   - [[-1, 6], 1, Concat, [1]] # cat head P5
    
   - [-1, 2, C3k2, [1024, True]] # 18 (P5/32-large)
    
  
    
   - [[12, 15, 18], 1, Detect, [nc]] # Detect(P3, P4, P5)
    
    
    
    
    代码解读

5.2 训练文件

复制代码

 import warnings

    
 warnings.filterwarnings('ignore')
    
 from ultralytics import YOLO
    
  
    
 if __name__ == '__main__':
    
     model = YOLO('ultralytics/cfg/models/v8/yolov8-C2f-FasterBlock.yaml')
    
     # model.load('yolov8n.pt') # loading pretrain weights
    
     model.train(data=r'替换数据集yaml文件地址',
    
             # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
    
             cache=False,
    
             imgsz=640,
    
             epochs=150,
    
             single_cls=False,  # 是否是单类别检测
    
             batch=4,
    
             close_mosaic=10,
    
             workers=0,
    
             device='0',
    
             optimizer='SGD', # using SGD
    
             # resume='', # 如过想续训就设置last.pt的地址
    
             amp=False,  # 如果出现训练损失为Nan可以关闭amp
    
             project='runs/train',
    
             name='exp',
    
             )
    
    
    
    
    代码解读

六、成功运行记录

此为成功运行过程中的一个截屏，请注意以下内容：已完成单次 epochs 训练操作后发现图片无法完整显示第二个 epochs 截图；修改后打印时发现存在一些小问题但不会影响任何功能操作待后续修复中进行处理

七、本文总结

本文正式分享的内容至此结束。为了让大家更好地学习到相关知识与技能，在这里向大家推荐我的YOLOv11改进版的有效提升专栏。目前该专栏平均质量已经达到98分左右，并且未来我将根据最新前沿会议的研究成果进行论文复现，并结合一些经典的改进机制分析与实践案例深入探讨相关技术要点。当前该专栏免费阅读（暂定期间，请各位读者及时关注以免错过学习机会） 。如果您觉得本文对你有所帮助，请订阅我们的课程以获取更多更新内容！

专栏梳理：本系列深入解析了YOLOv5s算法的优化升级路径，并系统总结了其在实际应用中的表现特点——目标检测领域的学术研究者必看

全部评论 (0)

还没有任何评论哟~

YOLOv11改进 | 主干/Backbone篇 | RepViT从视觉变换器（ViT）的视角重新审视CNN的目标检测网络（适配yolov11全系列）

一、本文介绍本文给大家来的改进机制是RepViT，用其替换我们整个主干网络，其是今年最新推出的主干网络，其主要思想是将轻量级视觉变换器（ViT）的设计原则应用于传统的轻量级卷积神经网络CNN。我将其...

YOLOv11改进 | 主干/Backbone篇 | 视觉变换器SwinTransformer目标检测网络（适配yolov11全系列模型）

一、本文介绍本文给大家带来的改进机制是利用SwinTransformer替换YOLOv11中的骨干网络其是一个开创性的视觉变换器模型，它通过使用位移窗口来构建分层的特征图，有效地适应了计算机视觉任务...

【YOLOv11改进- 主干网络】YOLOv11+RepViT: 从ViT的角度重新审视Mobile的CNN助力YOLOv11有效涨点；

YOLOV11目标检测改进实例与创新改进专栏专栏地址：YOLOv11目标检测改进专栏，包括backbone、neck、loss、分配策略、组合改进、原创改进等本文介绍发paper，毕业设计皆可使...

CV-visiontransformer经典论文解读|RepViT: Revisiting Mobile CNN From ViT PerspectiveRepViT：从ViT视角重新审视移动CNN

论文标题 RepViT:RevisitingMobileCNNFromViTPerspective RepViT：从ViT视角重新审视移动CNN 论文链接 RepViT:RevisitingMobil...

【YOLOv11改进- 主干网络】YOLOv11+TransNext特征提取网络（CVPR2024）: 基于YOLOv11的主干网络改进超轻量；

YOLOV11目标检测主干网络改进实例与创新改进专栏目录 YOLOV11目标检测主干网络改进实例与创新改进专栏本文介绍 1.完整代码获取 2.TransNext介绍摘要亮点优势 3\.Tran...

【YOLOv11改进- 主干网络】YOLOv11+WTConv（ECCV2024）: 基于YOLOv11的主干网络改进，大感受野的小波卷积

论文提出了WTConv，这是一个使用级联小波分解的层，并执行一组小卷积核的卷积，每个卷积专注于输入的不同频率带，并具有越来越大的感受野。这个过程能够在输入中对低频信息给予更多重视，同时仅增加少量可训练...

YOLOv11进行图像与视频的目标检测

一、AI应用系统实战项目项目名称项目名称 1.人脸识别与管理系统2.车牌识别与管理系统 3.手势识别系统4.人脸面部活体检测系统 5.YOLOv8自动标注6.人脸表情识别系统 7.行人跌倒检测系统8...

【YOLOv11改进- 主干网络】YOLOv11+Ghostnetv1: 华为轻量级目标检测模型Ghostnetv1助力YOLOv11有效涨点；

YOLOv8独家改进：backbone改进 | 视觉新主干！RMT：RetNet遇见视觉Transformer | CVPR2024

💡💡💡本文独家改进：RMT：一种强大的视觉Backbone，灵活地将显式空间先验集成到具有线性复杂度的视觉主干中，在多个下游任务分类/检测/分割上性能表现出色！ 💡💡💡Transforme...

【计算机视觉】目标检测主干网络（backbone）和颈部结构（neck）目录

感谢香港中文大学多媒体实验室（mmlab）以及所有涉及的作者、科研人员、开源社区其他贡献者的工作。引言计算机视觉特别是目标检测领域，网络架构通常被分为几个部分来处理不同的任务和功能。主干网络（ba...

是否确定退出登录?

YOLOv11改进 | 主干/Backbone篇 | RepViT从视觉变换器（ViT）的视角重新审视CNN的目标检测网络（适配yolov11全系列）

一、本文介绍

二、RepViT基本原理

三、RepViT的核心代码

四、手把手教你添加RepViT网络结构

4.1 修改一

4.2 修改二

4.3 修改三

4.4 修改四

4.5 修改五

4.6 修改六

4.7 修改七

4.8 修改八

注意！！！ 额外的修改！

打印计算量问题解决方案

注意事项！！！

五、RepViT的yaml文件

5.1 RepViT的yaml文件版本1

5.2 训练文件

六、成功运行记录

七、本文总结

全部评论 (0)

相关文章推荐

YOLOv11改进 | 主干/Backbone篇 | RepViT从视觉变换器（ViT）的视角重新审视CNN的目标检测网络（适配yolov11全系列）

YOLOv11改进 | 主干/Backbone篇 | 视觉变换器SwinTransformer目标检测网络（ 适配yolov11全系列模型）

【YOLOv11改进- 主干网络】YOLOv11+RepViT: 从ViT的角度重新审视Mobile的CNN助力YOLOv11有效涨点；

CV-visiontransformer经典论文解读|RepViT: Revisiting Mobile CNN From ViT PerspectiveRepViT：从ViT视角重新审视移动CNN

【YOLOv11改进- 主干网络】YOLOv11+TransNext特征提取网络（CVPR2024）: 基于YOLOv11的主干网络改进超轻量；

【YOLOv11改进- 主干网络】YOLOv11+WTConv（ECCV2024）: 基于YOLOv11的主干网络改进，大感受野的小波卷积

YOLOv11进行图像与视频的目标检测

【YOLOv11改进- 主干网络】YOLOv11+Ghostnetv1: 华为轻量级目标检测模型Ghostnetv1助力YOLOv11有效涨点；

YOLOv8独家改进：backbone改进 | 视觉新主干！RMT：RetNet遇见视觉Transformer | CVPR2024

【计算机视觉】目标检测主干网络（backbone）和颈部结构（neck）目录

注意！！！额外的修改！

YOLOv11改进 | 主干/Backbone篇 | 视觉变换器SwinTransformer目标检测网络（适配yolov11全系列模型）