论文阅读 Vision Transformer - VIT
文章目录
-
1 摘要
-
- 1.1 核心
-
2 模型架构
-
- 2.1 概览
- 2.2 对应CV的特定修改和相关理解
-
3 代码
-
4 总结
1 摘要
1.1 核心
通过将图像分割成patch并进行线性层编码以生成token特征的方法,利用transformer架构中的encoder模块进行图像分类
2 模型架构
2.1 概览

2.2 对应CV的特定修改和相关理解
解决问题:
transformer输入规模受限: 由于自注意力机制与backbone网络的结合导致计算复杂度达到O(n²),因此要求图像处理时需保证单个图像的像素数量不超过512×512以确保有效运算
解决方案包括: a) 将图像分解为小块并编码;b) 采用特征图转码策略;c) 对切片进行分块编码
-
transformer不依赖先验知识:基于卷积操作具有平移不变性(相同特征、相同卷积核对应相同输出)以及局部相似性(相邻特征之间相似导致相似输出),
相比之下,transformer架构没有卷积的概念,而是通过编码器-解码器的整体结构实现功能;为了实现目标需从头开始学习模型参数;这一过程通常需要大量的数据进行训练。 -
多种自注意力机制通常涉及繁琐的技术实现过程:无需复杂的调整和优化即可直接使用整个transformer模块
-
分类head:
解决:直接沿用transformer cls token -
position编码:
解决:1D编码
pipeline: 将224x224输入分割为16x16大小的patch,并分别经过位置编码与线性变换后,在每个token之后附加一个[CLS]标记。这些处理后的token一起输入到Encoder中,该Encoder包含L个自注意力模块。最终输出的[CLS]标记对应的分类结果即为目标类别
3 代码
一旦掌握了transformer机制对这种架构的理解异常容易。这篇论文在学术界的地位相当于开创性工作架构的设计较为基础因此值得进一步探索其实现细节。
import torch
from torch import nn
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
# helpers
def pair(t):
return t if isinstance(t, tuple) else (t, t)
# classes
class FeedForward(nn.Module):
def __init__(self, dim, hidden_dim, dropout = 0.):
super().__init__()
self.net = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class Attention(nn.Module):
def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
super().__init__()
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
self.heads = heads
self.scale = dim_head ** -0.5
self.norm = nn.LayerNorm(dim)
self.attend = nn.Softmax(dim = -1)
self.dropout = nn.Dropout(dropout)
# linear(1024 , 3072)
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
self.to_out = nn.Sequential(
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
) if project_out else nn.Identity()
def forward(self, x):
# [1, 65, 1024]
x = self.norm(x)
# [1, 65, 1024]
qkv = self.to_qkv(x).chunk(3, dim = -1)
# self.to_qkv(x) [1, 65, 3072]
# self.to_qkv(x).chunk(3,-1) [3, 1, 65, 1024]
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
# q,k,v [1, 65, 1024] -> [1, 16, 65, 64]
# 把 65个1024的特征分为 heads个65个d维的特征 然后每个heads去分别有自己要处理的隐藏层,对不同的特征建立不同学习能力
dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
# [1, 16, 65, 64] * [1, 16, 64, 65] -> [1, 16, 65, 65]
# scale 保证在softmax前所有的值都不太大
attn = self.attend(dots)
# softmax [1, 16, 65, 65]
attn = self.dropout(attn)
# dropout [1, 16, 65, 65]
out = torch.matmul(attn, v)
# out [1, 16, 65, 64]
out = rearrange(out, 'b h n d -> b n (h d)')
# out [1, 65, 1024]
return self.to_out(out)
# out [1, 65, 1024]
class Transformer(nn.Module):
def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
super().__init__()
self.norm = nn.LayerNorm(dim)
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(nn.ModuleList([
Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
FeedForward(dim, mlp_dim, dropout = dropout)
]))
def forward(self, x):
# [1, 65, 1024]
for attn, ff in self.layers:
# [1, 65, 1024]
x = attn(x) + x
# [1, 65, 1024]
x = ff(x) + x
# [1, 65, 1024]
return self.norm(x)
# shape不会改变
class ViT(nn.Module):
def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
super().__init__()
image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
num_patches = (image_height // patch_height) * (image_width // patch_width)
patch_dim = channels * patch_height * patch_width
assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
# num_patches 64
# patch_dim 3072
# dim 1024
self.to_patch_embedding = nn.Sequential(
#Rearrange是einops中的一个方法
# einops:灵活和强大的张量操作,可读性强和可靠性好的代码。支持numpy、pytorch、tensorflow等。
# 代码中Rearrage的意思是将传入的image(3,224,224),按照(3,(h,p1),(w,p2))也就是224=hp1,224 = wp2,接着把shape变成b (h w) (p1 p2 c)格式的,这样把图片分成了每个patch并且将patch拉长,方便下一步的全连接层
# 还有一种方法是采用窗口为16*16,stride 16的卷积核提取每个patch,然后再flatten送入全连接层。
Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
nn.LayerNorm(patch_dim),
nn.Linear(patch_dim, dim),
nn.LayerNorm(dim),
)
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
self.dropout = nn.Dropout(emb_dropout)
self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
self.pool = pool
self.to_latent = nn.Identity()
self.mlp_head = nn.Linear(dim, num_classes)
def forward(self, img):
# 1. [1, 3, 256, 256] 输入img
x = self.to_patch_embedding(img)
# 2. [1, 64, 1024] patch embd
b, n, _ = x.shape
# 3. [1, 1, 1024] cls_tokens
cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
# 4. [1, 65, 1024] cat [cls_tokens, x]
x = torch.cat((cls_tokens, x), dim=1)
# 5. [1, 65, 1024] add [x] [pos_embedding]
x += self.pos_embedding[:, :(n + 1)]
# 6. [1, 65, 1024] dropout
x = self.dropout(x)
# 7. [1, 65, 1024] N * transformer
x = self.transformer(x)
# 8. [1,1024] cls_x output
x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
# 9. [1,1024] cls_x output mean
x = self.to_latent(x)
# 10.[1,1024] nn.Identity()不改变输入和输出 占位层
return self.mlp_head(x)
# 11.[1,cls] mlp_cls_head
python

4 总结
multi-head与我原有理解间的偏差已得到校正。
我认为QKV会被分成N份相同的拷贝,在每一份上都会进行后续的线性操作。
代码中通过线性变换将QKV作为一个整体处理,并利用permute和rearrange操作将其分割成N个独立的部分;随后经过f(Q,K)作用后再重组回完整的一体。
