[论文笔记]DeconvNet语义分割
《2015_Noh_Cite=4488_Learning deconvolution network for semantic segmentation》
铺垫和引入
encoder基于预训练的卷积神经网络模块VGG-16架构执行特征提取任务;而解码器则通过反向传播技术中的反卷积操作和逆池化操作实现上采样过程以恢复原始图像细节
将候选区域(通过edge box进行标记)输入到训练后的网络中,在整个图像中各区域被独立处理并拼接而成。这样就能缓解物体大小不一带来的分割困难的同时提升了现有基于FCN技术的表现
这种方法存在不足。导致自动化水平有所降低。因此建议避免采用这种方法。
FCN的critical limitations
- (1)对于某些较大的物体(objects),我的感受野(sensory field)设置为预定义的固定尺寸(fixed-size)可能会导致图像分割出现错误(errors)。这是因为感受野不足以覆盖整个大物体的全部区域(whole object),导致系统只能识别到部分特征。这样就会出现将一个整体物体误判为两个分离物体的情况(splitting),即把一个东西分割成了两个独立的对象。对于大物体而言,在图像分割过程中仅利用局部信息进行预测(Label prediction is done with only local information for large objects),并且属于同一对象的不同区域可能被赋予不同的标签(different-不一致的 labels)。如图所示,你的感受野过小且无法捕捉到大物体的全局特征会导致多个小区域被误判为独立的对象(multiple small regions are incorrectly labeled as separate objects)。在训练过程中(train by cycle person),每个像素点都会被赋予对应的标签信息,在这种情况下即使小区域也可能会被正确识别出来。

(2)尺寸过小的对象容易受到感知范围的影响而被误判为背景部分。由于我的感知范围相对较大,在图像中这些微物通常会伴随较多的信息和细节特征而难以正确识别。因此,在图中的这些人尺寸过小的情况下, FCN往往将这些微物误判为背景区域中的纹理特征,并未能正确识别出它们。

该特征图经过多次下采样处理后变得模糊不清、分辨率显著降低且图像尺寸急剧缩小,在这一过程中物体的细节信息容易被丢失。随后将此模糊特征图输入反卷积层进行上采样处理后仍难以恢复原有的精细分割效果。
模型结构DeconvNet

Encoder和Decoder部分是在模仿VGG-16
Encoder(其名称为Convolution network):该系统连续应用两次'卷积–卷积–池化'模块,并在第三次时重复应用'卷积–卷积–卷积–池化'结构。
在中间部分设置了两个全连接层。然而我认为这并非必要之举。其一会导致模型参数数量急剧上升其二将一个二维图像转换为一维向量并没有带来实质性的价值对模型性能提升的作用也不够显著
Decoder(其通常被称作Deconvolution network):其在工作流程中执行了三种不同的upsampling operation序列:每种序列包含三个 convolutional layers;随后又进行了两种不同的upsampling operation序列:每种序列包含两个 convolutional layers.
一个softmax分类层
池化的作用:
去除上一层噪声的具体方式是什么?卷积网络中的池化操作如何通过使用感受野内的单一代表性值来提取所需特征?卷积网络中的池化操作旨在通过抽象化处理感受野内的激活值。
然而池化操作同样削弱了网络对输入图像位置信息的捕捉能力。spatial information in the receptive field is diminished during pooling operations.
代码实现
各层网络的参数和尺寸表

import torch
import torchvision.models as models
from torch import nn
vgg16_pretrained = models.vgg16(pretrained=False)
def decoder(input_channel, output_channel, num=3):
if num == 3:
decoder_body = nn.Sequential(
nn.ConvTranspose2d(input_channel, input_channel, 3, padding=1),
nn.ConvTranspose2d(input_channel, input_channel, 3, padding=1),
nn.ConvTranspose2d(input_channel, output_channel, 3, padding=1))
elif num == 2:
decoder_body = nn.Sequential(
nn.ConvTranspose2d(input_channel, input_channel, 3, padding=1),
nn.ConvTranspose2d(input_channel, output_channel, 3, padding=1))
return decoder_body
class VGG16_deconv(torch.nn.Module):
def __init__(self):
super(VGG16_deconv, self).__init__()
pool_list = [4, 9, 16, 23, 30]
for index in pool_list:
vgg16_pretrained.features[index].return_indices = True
self.encoder1 = vgg16_pretrained.features[:4]
self.pool1 = vgg16_pretrained.features[4]
self.encoder2 = vgg16_pretrained.features[5:9]
self.pool2 = vgg16_pretrained.features[9]
self.encoder3 = vgg16_pretrained.features[10:16]
self.pool3 = vgg16_pretrained.features[16]
self.encoder4 = vgg16_pretrained.features[17:23]
self.pool4 = vgg16_pretrained.features[23]
self.encoder5 = vgg16_pretrained.features[24:30]
self.pool5 = vgg16_pretrained.features[30]
self.classifier = nn.Sequential(
torch.nn.Linear(512 * 11 * 15, 4096),
torch.nn.ReLU(),
torch.nn.Linear(4096, 512 * 11 * 15),
torch.nn.ReLU(),
)
self.decoder5 = decoder(512, 512)
self.unpool5 = nn.MaxUnpool2d(2, 2)
self.decoder4 = decoder(512, 256)
self.unpool4 = nn.MaxUnpool2d(2, 2)
self.decoder3 = decoder(256, 128)
self.unpool3 = nn.MaxUnpool2d(2, 2)
self.decoder2 = decoder(128, 64, 2)
self.unpool2 = nn.MaxUnpool2d(2, 2)
self.decoder1 = decoder(64, 12, 2)
self.unpool1 = nn.MaxUnpool2d(2, 2)
def forward(self, x): # 3, 352, 480
encoder1 = self.encoder1(x) # 64, 352, 480
output_size1 = encoder1.size() # 64, 352, 480
pool1, indices1 = self.pool1(encoder1) # 64, 176, 240
encoder2 = self.encoder2(pool1) # 128, 176, 240
output_size2 = encoder2.size() # 128, 176, 240
pool2, indices2 = self.pool2(encoder2) # 128, 88, 120
encoder3 = self.encoder3(pool2) # 256, 88, 120
output_size3 = encoder3.size() # 256, 88, 120
pool3, indices3 = self.pool3(encoder3) # 256, 44, 60
encoder4 = self.encoder4(pool3) # 512, 44, 60
output_size4 = encoder4.size() # 512, 44, 60
pool4, indices4 = self.pool4(encoder4) # 512, 22, 30
encoder5 = self.encoder5(pool4) # 512, 22, 30
output_size5 = encoder5.size() # 512, 22, 30
pool5, indices5 = self.pool5(encoder5) # 512, 11, 15
pool5 = pool5.view(pool5.size(0), -1)
fc = self.classifier(pool5)
fc = fc.reshape(1, 512, 11, 15)
unpool5 = self.unpool5(input=fc, indices=indices5, output_size=output_size5) # 512, 22, 30
decoder5 = self.decoder5(unpool5) # 512, 22, 30
unpool4 = self.unpool4(input=decoder5, indices=indices4, output_size=output_size4) # 512, 44, 60
decoder4 = self.decoder4(unpool4) # 256, 44, 60
unpool3 = self.unpool3(input=decoder4, indices=indices3, output_size=output_size3) # 256, 88, 120
decoder3 = self.decoder3(unpool3) # 128, 88, 120
unpool2 = self.unpool2(input=decoder3, indices=indices2, output_size=output_size2) # 128, 176, 240
decoder2 = self.decoder2(unpool2) # 64, 176, 240
unpool1 = self.unpool1(input=decoder2, indices=indices1, output_size=output_size1) # 64, 352, 480
decoder1 = self.decoder1(unpool1) # 12, 352, 480
return decoder1
if __name__ == "__main__":
import torch as t
rgb = t.randn(1, 3, 352, 480)
net = VGG16_deconv()
out = net(rgb)
print(out.shape)

