MobileViT 网络测试

阅读量：

MobileViT是一种结合了CNN和Transformer的轻量级视觉模型，能够实现端侧实时性，并适用于分类、检测和分割任务。其核心思想是通过将CNN和Transformer结合，充分利用了CNN的特征提取能力和Transformer的长距离依赖建模能力。MobileViT在ImageNet-1k数据集上表现出色，其参数量与推理速度在不同模型大小（如XXS、XS、S）下均优于MobileNet和ViT。例如，3.5M参数的MobileNetV2在ImageNet-1k上的Top-1准确率为73.3%，而MobileViT在相同参数量下Top-1准确率达到74.8%。此外，MobileViT在检测和分割任务中也表现出色，如在COCO目标检测任务中，MobileViT在参数量相近的情况下，比MobileNetv3的准确率高5.7%。尽管其推理速度稍慢于ViT，但其轻量性和高效性使其成为移动端部署的理想选择。

mobilevit-pytorch/mobilevit/mobilevit.py at master · chinhsuanwu/mobilevit-pytorch · GitHub

MobileFormer 与上面的不太一样：

https://github.com/slwang9353/MobileFormer

实时Transformer类轻量型MobileViT

本文首次报道了CNN+Transformer的端侧实时应用，并不仅限于分类任务，同时还将检测与分割纳入考虑范畴，实为近期少有的佳作之一。然而，其推理速度相较于传统的CNN有一定差距。例如，参数量为3.5M的MobileNetV2推理速度仅为0.92ms，精度达73.3%；而参数量为2.3M的MobileViT的推理速度则为7.28ms，精度略高至74.8%。遗憾的是，MobileViT尚未公开其源代码，因此无法亲自体验其高性能与高推理速度。

实验结果表明，MobileViT在多样的任务场景和数据集上明显超越了基于CNN和ViT的网络。

基于ImageNet-1k数据集，MobileViT在参数数量约为600万个时，表现出了78.4%的Top-1准确率。相较于MobileNetv3和DeiT，该模型在保持相同参数数量的情况下，准确率分别高出3.2%和6.2%。

有预训练的：

GitHub - wilile26811249/MobileViT: An unofficial PyTorch implementation of MobileViT, developed based on the paper "MobileViT: Lightweight, versatile, and mobile-friendly Vision Transformer".

模型43M：

MobileViT	ImageNet-1k	0.05	Cosine LR	SGDM	1e-5	61.918%	83.05%

网友实现的：

awesome_lightweight_networks/mobile_vit.py at main · Mr. Murufeng’s awesome_lightweight_networks · GitHub

https://github.com/Xingyu-Romantic/MobileViT_Openpose

MobileViT在参数数量相近的条件下，在MS-COCO目标检测任务中的准确率数值优于MobileNetv3，其准确率数值高出5.7个百分点。

模型大小：

MobileViT s 19m

MobileViT xs 9.21m

MobileViT xxs 5.19m

Apple推出全新MobileViT视觉Transformer，兼具轻量化与移动端适配性

复制代码

 import torch

    
 import torch.nn as nn
    
 from typing import Callable, Any, Optional, List
    
 from einops import rearrange
    
  
    
 import torch
    
 import torch.nn as nn
    
  
    
  
    
 class ConvNormAct(nn.Module):
    
     def __init__(self,
    
     in_channels: int,
    
     out_channels: int,
    
     kernel_size: int = 3,
    
     stride = 1,
    
     padding: Optional[int] = None,
    
     groups: int = 1,
    
     norm_layer: Optional[Callable[..., torch.nn.Module]] = torch.nn.BatchNorm2d,
    
     activation_layer: Optional[Callable[..., torch.nn.Module]] = torch.nn.SiLU,
    
     dilation: int = 1
    
     ):
    
     super(ConvNormAct, self).__init__()
    
     if padding is None:
    
         padding = (kernel_size - 1) // 2 * dilation
    
     self.conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding,
    
                                 dilation = dilation, groups = groups, bias = norm_layer is None)
    
  
    
     self.norm_layer = nn.BatchNorm2d(out_channels) if norm_layer is None else norm_layer(out_channels)
    
     self.act = activation_layer() if activation_layer is not None else activation_layer
    
  
    
     def forward(self, x: torch.Tensor) -> torch.Tensor:
    
     x = self.conv(x)
    
     if self.norm_layer is not None:
    
         x = self.norm_layer(x)
    
     if self.act is not None:
    
         x = self.act(x)
    
     return x
    
  
    
  
    
 class PreNorm(nn.Module):
    
     def __init__(self, dim, fn):
    
     super().__init__()
    
     self.norm = nn.LayerNorm(dim)
    
     self.fn = fn
    
  
    
     def forward(self, x, **kwargs):
    
     return self.fn(self.norm(x), **kwargs)
    
  
    
  
    
 class FFN(nn.Module):
    
     def __init__(self, dim, hidden_dim, dropout=0.):
    
     super().__init__()
    
     self.net = nn.Sequential(
    
         nn.Linear(dim, hidden_dim),
    
         nn.SiLU(),
    
         nn.Dropout(dropout),
    
         nn.Linear(hidden_dim, dim),
    
         nn.Dropout(dropout)
    
     )
    
  
    
     def forward(self, x):
    
     return self.net(x)
    
  
    
  
    
 class MultiHeadSelfAttention(nn.Module):
    
     """
    
     Implement multi head self attention layer using the "Einstein summation convention".
    
     Paper: https://arxiv.org/abs/1706.03762
    
     Blog: https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
    
     Parameters
    
     ----------
    
     dim:
    
     Token's dimension, EX: word embedding vector size
    
     num_heads:
    
     The number of distinct representations to learn
    
     dim_head:
    
     The dimension of the each head
    
     """
    
     def __init__(self, dim, num_heads = 8, dim_head = None):
    
     super(MultiHeadSelfAttention, self).__init__()
    
     self.num_heads = num_heads
    
     self.dim_head = int(dim / num_heads) if dim_head is None else dim_head
    
     _weight_dim = self.num_heads * self.dim_head
    
     self.to_qvk = nn.Linear(dim, _weight_dim * 3, bias = False)
    
     self.scale_factor = dim ** -0.5
    
  
    
     self.scale_factor = dim ** -0.5
    
  
    
     # Weight matrix for output, Size: num_heads*dim_head X dim
    
     # Final linear transformation layer
    
     self.w_out = nn.Linear(_weight_dim, dim, bias = False)
    
  
    
     def forward(self, x):
    
     qkv = self.to_qvk(x).chunk(3, dim = -1)
    
     q, k, v = map(lambda t: rearrange(t, 'b p n (h d) -> b p h n d', h = self.num_heads), qkv)
    
  
    
     dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale_factor
    
     attn = torch.softmax(dots, dim = -1)
    
     out = torch.matmul(attn, v)
    
     out = rearrange(out, 'b p h n d -> b p n (h d)')
    
     return self.w_out(out)
    
  
    
  
    
 class Transformer(nn.Module):
    
     def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.1):
    
     super().__init__()
    
     self.layers = nn.ModuleList([])
    
     for _ in range(depth):
    
         self.layers.append(nn.ModuleList([
    
             PreNorm(dim, MultiHeadSelfAttention(dim, heads, dim_head)),
    
             PreNorm(dim, FFN(dim, mlp_dim, dropout))
    
         ]))
    
  
    
     def forward(self, x):
    
     for attn, ff in self.layers:
    
         x = attn(x) + x
    
         x = ff(x) + x
    
     return x
    
  
    
  
    
 class InvertedResidual(nn.Module):
    
     """
    
     MobileNetv2 InvertedResidual block
    
     """
    
     def __init__(self, in_channels, out_channels, stride = 1, expand_ratio = 2, act_layer = nn.SiLU):
    
     super(InvertedResidual, self).__init__()
    
     self.stride = stride
    
     self.use_res_connect = self.stride == 1 and in_channels == out_channels
    
     hidden_dim = int(round(in_channels * expand_ratio))
    
  
    
     layers = []
    
     if expand_ratio != 1:
    
         layers.append(ConvNormAct(in_channels, hidden_dim, kernel_size = 1, activation_layer = None))
    
  
    
     # Depth-wise convolution
    
     layers.append(
    
         ConvNormAct(hidden_dim, hidden_dim, kernel_size = 3, stride = stride,
    
                     padding = 1, groups = hidden_dim, activation_layer = act_layer)
    
     )
    
     # Point-wise convolution
    
     layers.append(
    
         nn.Conv2d(hidden_dim, out_channels, kernel_size = 1, stride = 1, bias = False)
    
     )
    
     layers.append(nn.BatchNorm2d(out_channels))
    
  
    
     self.conv = nn.Sequential(*layers)
    
  
    
     def forward(self, x):
    
     if self.use_res_connect:
    
         return x + self.conv(x)
    
     else:
    
         return self.conv(x)
    
  
    
  
    
 class MobileVitBlock(nn.Module):
    
     def __init__(self, in_channels, out_channels, d_model, layers, mlp_dim):
    
     super(MobileVitBlock, self).__init__()
    
     # Local representation
    
     self.local_representation = nn.Sequential(
    
         # Encode local spatial information
    
         ConvNormAct(in_channels, in_channels, 3),
    
         # Projects the tensor to a high-diementional space
    
         ConvNormAct(in_channels, d_model, 1)
    
     )
    
  
    
     self.transformer = Transformer(d_model, layers, 1, 32, mlp_dim, 0.1)
    
  
    
     # Fusion block
    
     self.fusion_block1 = nn.Conv2d(d_model, in_channels, kernel_size = 1)
    
     self.fusion_block2 = nn.Conv2d(in_channels * 2, out_channels, 3, padding = 1)
    
  
    
     def forward(self, x):
    
     local_repr = self.local_representation(x)
    
     # global_repr = self.global_representation(local_repr)
    
     _, _, h, w = local_repr.shape
    
     global_repr = rearrange(local_repr, 'b d (h ph) (w pw) -> b (ph pw) (h w) d', ph=2, pw=2)
    
     global_repr = self.transformer(global_repr)
    
     global_repr = rearrange(global_repr, 'b (ph pw) (h w) d -> b d (h ph) (w pw)', h=h//2, w=w//2, ph=2, pw=2)
    
  
    
     # Fuse the local and gloval features in the concatenation tensor
    
     fuse_repr = self.fusion_block1(global_repr)
    
     result = self.fusion_block2(torch.cat([x, fuse_repr], dim = 1))
    
     return result
    
  
    
  
    
 model_cfg = {
    
     "xxs":{
    
     "features": [16, 16, 24, 24, 48, 48, 64, 64, 80, 80, 320],
    
     "d": [64, 80, 96],
    
     "expansion_ratio": 2,
    
     "layers": [2, 4, 3]
    
     },
    
     "xs":{
    
     "features": [16, 32, 48, 48, 64, 64, 80, 80, 96, 96, 384],
    
     "d": [96, 120, 144],
    
     "expansion_ratio": 4,
    
     "layers": [2, 4, 3]
    
     },
    
     "s":{
    
     "features": [16, 32, 64, 64, 96, 96, 128, 128, 160, 160, 640],
    
     "d": [144, 192, 240],
    
     "expansion_ratio": 4,
    
     "layers": [2, 4, 3]
    
     },
    
 }
    
  
    
 class MobileViT(nn.Module):
    
     def __init__(self, img_size, features_list, d_list, transformer_depth, expansion, num_classes = 1000):
    
     super(MobileViT, self).__init__()
    
  
    
     self.stem = nn.Sequential(
    
         nn.Conv2d(in_channels = 3, out_channels = features_list[0], kernel_size = 3, stride = 2, padding = 1),
    
         InvertedResidual(in_channels = features_list[0], out_channels = features_list[1], stride = 1, expand_ratio = expansion),
    
     )
    
  
    
     self.stage1 = nn.Sequential(
    
         InvertedResidual(in_channels = features_list[1], out_channels = features_list[2], stride = 2, expand_ratio = expansion),
    
         InvertedResidual(in_channels = features_list[2], out_channels = features_list[2], stride = 1, expand_ratio = expansion),
    
         InvertedResidual(in_channels = features_list[2], out_channels = features_list[3], stride = 1, expand_ratio = expansion)
    
     )
    
  
    
     self.stage2 = nn.Sequential(
    
         InvertedResidual(in_channels = features_list[3], out_channels = features_list[4], stride = 2, expand_ratio = expansion),
    
         MobileVitBlock(in_channels = features_list[4], out_channels = features_list[5], d_model = d_list[0],
    
                        layers = transformer_depth[0], mlp_dim = d_list[0] * 2)
    
     )
    
  
    
     self.stage3 = nn.Sequential(
    
         InvertedResidual(in_channels = features_list[5], out_channels = features_list[6], stride = 2, expand_ratio = expansion),
    
         MobileVitBlock(in_channels = features_list[6], out_channels = features_list[7], d_model = d_list[1],
    
                        layers = transformer_depth[1], mlp_dim = d_list[1] * 4)
    
     )
    
  
    
     self.stage4 = nn.Sequential(
    
         InvertedResidual(in_channels = features_list[7], out_channels = features_list[8], stride = 2, expand_ratio = expansion),
    
         MobileVitBlock(in_channels = features_list[8], out_channels = features_list[9], d_model = d_list[2],
    
                        layers = transformer_depth[2], mlp_dim = d_list[2] * 4),
    
         nn.Conv2d(in_channels = features_list[9], out_channels = features_list[10], kernel_size = 1, stride = 1, padding = 0)
    
     )
    
  
    
     self.avgpool = nn.AvgPool2d(kernel_size = img_size // 32)
    
     self.fc = nn.Linear(features_list[10], num_classes)
    
  
    
  
    
     def forward(self, x):
    
     # Stem
    
     x = self.stem(x)
    
     # Body
    
     x = self.stage1(x)
    
     x = self.stage2(x)
    
     x = self.stage3(x)
    
     x = self.stage4(x)
    
     # Head
    
     x = self.avgpool(x)
    
     x = x.view(x.size(0), -1)
    
     x = self.fc(x)
    
     return x
    
  
    
  
    
 def MobileViT_XXS(img_size = 256, num_classes = 1000):
    
     cfg_xxs = model_cfg["xxs"]
    
     model_xxs = MobileViT(img_size, cfg_xxs["features"], cfg_xxs["d"], cfg_xxs["layers"], cfg_xxs["expansion_ratio"], num_classes)
    
     return model_xxs
    
  
    
 def MobileViT_XS(img_size = 256, num_classes = 1000):
    
     cfg_xs = model_cfg["xs"]
    
     model_xs = MobileViT(img_size, cfg_xs["features"], cfg_xs["d"], cfg_xs["layers"], cfg_xs["expansion_ratio"], num_classes)
    
     return model_xs
    
  
    
 def MobileViT_S(img_size = 256, num_classes = 1000):
    
     cfg_s = model_cfg["s"]
    
     model_s = MobileViT(img_size, cfg_s["features"], cfg_s["d"], cfg_s["layers"], cfg_s["expansion_ratio"], num_classes)
    
     return model_s
    
  
    
  
    
 if __name__ == "__main__":
    
  
    
  
    
     img = torch.randn(1, 3, 256, 256)
    
  
    
     cfg_xxs = model_cfg["xxs"]
    
     model_xxs = MobileViT(256, cfg_xxs["features"], cfg_xxs["d"], cfg_xxs["layers"], cfg_xxs["expansion_ratio"])
    
  
    
     cfg_xs = model_cfg["xs"]
    
     model_xs = MobileViT(256, cfg_xs["features"], cfg_xs["d"], cfg_xs["layers"], cfg_xs["expansion_ratio"])
    
  
    
     cfg_s = model_cfg["s"]
    
     model_s = MobileViT(256, cfg_s["features"], cfg_s["d"], cfg_s["layers"], cfg_s["expansion_ratio"])
    
  
    
  
    
     model_s.eval()
    
     model_path = "dicenet.pth"
    
     torch.save(model_s.state_dict(), model_path)
    
  
    
     import os
    
     import time
    
  
    
     fsize = os.path.getsize(model_path)
    
     fsize = fsize / float(1024 * 1024)
    
  
    
     print(f"model size {round(fsize, 2)} m")
    
  
    
     # XXS: 1.3M 、 XS: 2.3M 、 S: 5.6M
    
     print("XXS params: ",round( sum(p.numel() for p in model_xxs.parameters())/(1024*1024),2))
    
     print(" XS params: ", round(sum(p.numel() for p in model_xs.parameters())/(1024*1024),2))
    
     print("  S params: ",round( sum(p.numel() for p in model_s.parameters())/(1024*1024),2))

全部评论 (0)

还没有任何评论哟~

MobileViT 网络测试

mobilevitpytorch/mobilevit.pyatmaster·chinhsuanwu/mobilevitpytorch·GitHub MobileFormer与上面的不太一样： <htt...

测试网络

1网络连通性测试 2ping本网段另一台主机的IP地址，观察结果 3ping一个不存在的主机地址，观察结果 4查询站点www.baidu.com的IP地址是多少 5使用host命令进行解析，观察执行结...

网络测试的测试点

网络测试的测试点 1.Ping命令测试网络是否连通 2.使用网络测试工具测试交换机、路由器的吞吐量 3.使用网络测试工具测试交换机、路由器的延迟 4.使用网络测试工具测试交换机、路由器的丢包 5.使用...

android 测试网络延时,Android:Ping命令测试网络

importjava.io.BufferedReader; importjava.io.IOException; importjava.io.InputStream; importjava.io.In...

LINUX 测试网络

问题 1网络连通性测试 2ping本网段另一台主机的IP地址，观察结果 3ping一个不存在的主机地址，观察结果 4查询站点www.baidu.com的IP地址是多少 5使用host命令进行解析，观察...

java 测试网络_java如何测试网络连通性

本文实例为大家分享了Java测试网络连通性的方法，供大家参考，具体内容如下第一种方式：利用java运行时： Java代码 / testnetwork @paramip / privatevoidge...

测试面试之网络

什么是长连接和短连接，他们的定义区别及应用场景？前提： HTTP/1.0默认使用短连接，HTTP/1.1开始默认使用长连接； HTTP协议的长连接和短连接，实质就是TCP协议的长连接和短连接； TC...

【YOLOv5 Note-9】YOLOv5模型网络结构中加入MobileViT模块

一、神经网络的前中后期在神经网络中，特别是在深度卷积神经网络（CNN）中，“网络早期（低层）”、“网络中期（中层）”和“网络后期（高层）”通常指的是网络结构中不同层级的部分，每个部分在特征提取和信息...

计算机网络网络层测试

单项选择题 1.为什么距离矢量路由协议被链路状态路由协议所取代？（C） A.路由器内存消耗过大 B.网络带宽消耗过大 C.无穷计数问题 D.无穷回路问题解析：所有距离矢量路由协议均使用Bellman...

GoLang mock网络测试

实际工作中的业务场景往往会比较复杂，无论我们的代码是作为server端对外提供服务或者还是我们依赖别人提供的网络服务（调用别人提供的API接口）的场景，我们通常都不想在测试过程中真正的建立网络连接。本...

是否确定退出登录?

MobileViT 网络测试

全部评论 (0)

相关文章推荐

MobileViT 网络测试

测试网络

网络测试的测试点

android 测试网络延时,Android:Ping命令测试网络

LINUX 测试网络

java 测试网络_java如何测试网络连通性

测试面试之网络

【YOLOv5 Note-9】YOLOv5模型网络结构中加入MobileViT模块

计算机网络网络层测试

GoLang mock网络测试