论文阅读 Vision Transformer - VIT

阅读量：

文章目录

1 摘要
- 1.1 核心
2 模型架构
- 2.1 概览
- 2.2 对应CV的特定修改和相关理解
3 代码
4 总结

1 摘要

1.1 核心

通过将图像分割成patch并进行线性层编码以生成token特征的方法，利用transformer架构中的encoder模块进行图像分类

2 模型架构

2.1 概览

2.2 对应CV的特定修改和相关理解

解决问题：

transformer输入规模受限: 由于自注意力机制与backbone网络的结合导致计算复杂度达到O(n²)，因此要求图像处理时需保证单个图像的像素数量不超过512×512以确保有效运算
解决方案包括: a) 将图像分解为小块并编码；b) 采用特征图转码策略；c) 对切片进行分块编码

transformer不依赖先验知识：基于卷积操作具有平移不变性（相同特征、相同卷积核对应相同输出）以及局部相似性（相邻特征之间相似导致相似输出），
相比之下，transformer架构没有卷积的概念，而是通过编码器-解码器的整体结构实现功能；为了实现目标需从头开始学习模型参数；这一过程通常需要大量的数据进行训练。
多种自注意力机制通常涉及繁琐的技术实现过程：无需复杂的调整和优化即可直接使用整个transformer模块
分类head：
解决：直接沿用transformer cls token
position编码：
解决：1D编码

pipeline： 将224x224输入分割为16x16大小的patch，并分别经过位置编码与线性变换后，在每个token之后附加一个[CLS]标记。这些处理后的token一起输入到Encoder中，该Encoder包含L个自注意力模块。最终输出的[CLS]标记对应的分类结果即为目标类别

3 代码

一旦掌握了transformer机制对这种架构的理解异常容易。这篇论文在学术界的地位相当于开创性工作架构的设计较为基础因此值得进一步探索其实现细节。

复制代码

    import torch
    from torch import nn
    
    from einops import rearrange, repeat
    from einops.layers.torch import Rearrange
    
    # helpers
    
    def pair(t):
    return t if isinstance(t, tuple) else (t, t)
    
    # classes
    
    class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    
    def forward(self, x):
        return self.net(x)
    
    class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)
    
        self.heads = heads
        self.scale = dim_head ** -0.5
    
        self.norm = nn.LayerNorm(dim)
    
        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)
    
        # linear(1024 , 3072)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
    
        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()
    
    def forward(self, x):
        # [1, 65, 1024]
        x = self.norm(x)
        # [1, 65, 1024]
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        # self.to_qkv(x)                [1, 65, 3072]
        # self.to_qkv(x).chunk(3,-1)    [3, 1, 65, 1024]
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
        # q,k,v                         [1, 65, 1024] -> [1, 16, 65, 64]
        # 把 65个1024的特征分为 heads个65个d维的特征 然后每个heads去分别有自己要处理的隐藏层，对不同的特征建立不同学习能力
        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        # [1, 16, 65, 64] * [1, 16, 64, 65] -> [1, 16, 65, 65]
        # scale 保证在softmax前所有的值都不太大
    
        attn = self.attend(dots)
        # softmax [1, 16, 65, 65]
        
        attn = self.dropout(attn)
        # dropout [1, 16, 65, 65]
        
        out = torch.matmul(attn, v)
        # out [1, 16, 65, 64]
        
        out = rearrange(out, 'b h n d -> b n (h d)')
        # out [1, 65, 1024]
        
        return self.to_out(out)
        # out [1, 65, 1024]
        
    
    class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
                FeedForward(dim, mlp_dim, dropout = dropout)
            ]))
    
    def forward(self, x):
        # [1, 65, 1024]
        for attn, ff in self.layers:
            # [1, 65, 1024]
            x = attn(x) + x
            # [1, 65, 1024]
            x = ff(x) + x
    
        # [1, 65, 1024]
        return self.norm(x)
        # shape不会改变
    
    class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)
    
        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
    
        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width
        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
    
        # num_patches   64
        # patch_dim     3072
        # dim           1024
        self.to_patch_embedding = nn.Sequential(
            #Rearrange是einops中的一个方法
            # einops：灵活和强大的张量操作，可读性强和可靠性好的代码。支持numpy、pytorch、tensorflow等。
            # 代码中Rearrage的意思是将传入的image（3，224，224），按照（3，（h,p1）,(w,p2))也就是224=hp1,224 = wp2，接着把shape变成b (h w) (p1 p2 c)格式的，这样把图片分成了每个patch并且将patch拉长，方便下一步的全连接层
            # 还有一种方法是采用窗口为16*16，stride 16的卷积核提取每个patch，然后再flatten送入全连接层。
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
            nn.LayerNorm(dim),
        )
    
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)
    
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
    
        self.pool = pool
        self.to_latent = nn.Identity()
    
        self.mlp_head = nn.Linear(dim, num_classes)
    
    def forward(self, img):
        # 1. [1, 3, 256, 256]       输入img
        x = self.to_patch_embedding(img)
        # 2. [1, 64, 1024]          patch embd
        b, n, _ = x.shape
        # 3. [1, 1, 1024]           cls_tokens
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        # 4. [1, 65, 1024]          cat [cls_tokens, x]
        x = torch.cat((cls_tokens, x), dim=1)
        # 5. [1, 65, 1024]          add [x] [pos_embedding]
        x += self.pos_embedding[:, :(n + 1)]
        # 6. [1, 65, 1024]          dropout
        x = self.dropout(x)
        # 7. [1, 65, 1024]          N * transformer
        x = self.transformer(x)
        # 8. [1,1024]               cls_x output
        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
        # 9. [1,1024]               cls_x output mean
        x = self.to_latent(x)
        # 10.[1,1024]               nn.Identity()不改变输入和输出 占位层
        return self.mlp_head(x)
        # 11.[1,cls]                mlp_cls_head
    
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-19/Gja4xVWwNHEhC8beRdYDyB1Afoc3.png)

4 总结

multi-head与我原有理解间的偏差已得到校正。
我认为QKV会被分成N份相同的拷贝，在每一份上都会进行后续的线性操作。
代码中通过线性变换将QKV作为一个整体处理，并利用permute和rearrange操作将其分割成N个独立的部分；随后经过f(Q,K)作用后再重组回完整的一体。

全部评论 (0)

还没有任何评论哟~

论文阅读 Vision Transformer - VIT

文章目录 1摘要 1.1核心 2模型架构 2.1概览 2.2对应CV的特定修改和相关理解 3代码 4总结 1摘要 1.1核心通过将图像切成patch线形层编码成token特征编码的方法，用trans...

【论文精读】 Vision Transformer（ViT）

摘要验证了当拥有足够多的数据进行预训练的时候，ViT的表现就会超过CNN，突破transformer缺少归纳偏置的限制，可以在下游任务中获得较好的迁移效果。

51-6 Vision Transformer ，ViT 论文精读

李沐（沐神）、朱毅讲得真的好，干货蛮多，值得认真读很多遍，甚至可以当成多模态大模型基础课程学习。论文原文:Animageisworth16x16words:transformersforimager...

【论文阅读】Vision Transformer

VisionTransformer 1\.模型介绍在计算机视觉领域中，多数算法都是保持CNN整体结构不变，在CNN中增加attention模块或者使用attention模块替换CNN中的某些部分。有...

ViT（Vision Transformer）全文精读

1\.相关链接：原文链接：AnImageisWorth16x16Words:TransformersforImageRecognitionatScalearxiv.org 原文开源代码：GitHu...

【论文阅读】ViT-Vision Transformer（简介+代码+面试常见问题）

上一篇论文已经对Transformer模型进行了比较详尽的介绍，之后在NLP领域又相当多的有名的工作出现，例如Bert、GPT等，但是在计算机视觉领域中，怎样合理地把图像数据输入Transformer...

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows论文阅读

SwinTransformer:HierarchicalVisionTransformerusingShiftedWindows论文阅读摘要介绍相关工作方法整个架构基于selfattent...

论文阅读：Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

论文阅读：SwinTransformer:HierarchicalVisionTransformerusingShiftedWindows 这篇论文介绍了一种名为SwinTransformer的新型视...

Vision Transformer（ViT）

论文地址：<https://arxiv.org/pdf/2010.11929v2.pdf 基于纯自注意力机制的Transform模型，现在在自然语言处理领域占据着首要的地位，它主要是在大型文本语料库上...

【论文解读】V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer

V2XViT 摘要引言方法 V2Xmetadatasharing Featureextraction. Compressionandsharing V2XViT Detectionhead V2X...

是否确定退出登录?

论文阅读 Vision Transformer - VIT

文章目录

1 摘要

1.1 核心

2 模型架构

2.1 概览

2.2 对应CV的特定修改和相关理解

3 代码

4 总结

全部评论 (0)

相关文章推荐

论文阅读 Vision Transformer - VIT

【论文精读】 Vision Transformer（ViT）

51-6 Vision Transformer ，ViT 论文精读

【论文阅读】Vision Transformer

ViT（Vision Transformer）全文精读

【论文阅读】ViT-Vision Transformer（简介+代码+面试常见问题）

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows论文阅读

论文阅读：Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Vision Transformer（ViT）

【论文解读】V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer