【论文笔记】VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE

阅读量：

VisualBert ：适用很多种类的任务，结构简单

和VL-Bert的区别：

token-enbedding与feature-enbedding共同构建了一个嵌入层
position编码层将被用于对齐
在本篇文章之前完成的论文是VL-Bert

训练的三个阶段

Task-Agnostic Pre-Training Here we train VisualBERT on COCO using two visually-grounded language model objectives.
(1) Masked language modeling with the image. Some elements of text input are masked and must be predicted but vectors corresponding to image regions are not masked.
(2) Sentence-image prediction. For COCO, where there are multiple captions corresponding to one image, we provide a text segment consisting of two captions. One of the caption is describing the image, while the other has a 50% chance to be another corresponding caption and a 50% chance to be a randomly drawn caption. The model is trained to distinguish these two situations.
.
Task-Specific Pre-Training Before fine-tuning VisualBERT to a downstream task, we find it beneficial to train the model using the data of the task with the masked language modeling with the image objective. This step allows the model to adapt to the new target domain.
.
Fine-Tuning This step mirrors BERT fine-tuning, where a task-specific input, output, and objective are introduced, and the Transformer is trained to maximize performance on the task.

有点类似之前打比赛时候的操作，在Pre-train和finetune之间插入额外的Pre-train

四个实验任务

包含：

问答系统 VQA 2.0（视觉问答基准）
基于常识的推理系统 VCR系统（问答模式）：包含两个子模块(Q-A)和(QA-R)，分别对应问题类型和问题回答模块
图像与文本匹配的问题集 NLVR 2（用于判断图片与文本匹配性的问题集）
图像识别与描述关联的基准数据集 Flickr30K数据集（用于图像与描述匹配）

消融实验

C1 跳过预训练阶段或仅用文本进行训练

效果下降，说明使用文本和图片对模型进行预训练是必要的。

C2 通过限定自注意力的作用范围，在早期保持两种模态的分离状态，并最终在顶层实现模块间的深度交互

效果下降，说明两个模态不能彼此隔绝，相互融合对齐是很重要的

C3 BERT初始化

模型对于纯文本初始化并不敏感，下降并不多

减少预训练任务（匹配）

实验表明此项预训练任务有用，但效果不如其他部分大。

全部评论 (0)

还没有任何评论哟~

【论文笔记】VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE

VisualBert：适用很多种类的任务，结构简单和VLBert的区别：由tokenenbedding和featureembedding共同组成了一个embedding层 position编码层被...

2019 VisualBERT: a Simple and Performant Baseline for Vision and Language

摘要我们提出VisualBERT，一种建模广泛视觉和语言任务的简单和灵活的框架。VisualBERT包含一些Transformer层的堆叠，这些层隐式的将输入文本和与输入图像相关的区域与自注意力对齐...

论文笔记--SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural

论文笔记SentencePiece:AsimpleandlanguageindependentsubwordtokenizeranddetokenizerforNeuralTextProcessing...

论文笔记：A Simple and Effective Pruning Approach for Large Language Models

iclr2024reviewer评分5668 1intro 大模型网络剪枝的paper 在努力保持性能的同时，舍弃网络权重的一个子集现有方法要么需要重新训练这对于十亿级别的LLMs来说往往不现实...

论文笔记：Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

论文链接：https://arxiv.org/pdf/1911.10194.pdf 核心思想： 1.一种高效的bottomup全景分割方法，比twostage更快 2.一个统一的backbone，分出...

DENSE POINT PREDICTION: A SIMPLE BASELINE FOR CROWD COUNTING AND LOCALIZATION（论文阅读笔记）

DENSEPOINTPREDICTION:ASIMPLEBASELINEFORCROWDCOUNTINGANDLOCALIZATION（论文阅读笔记）文章出处：IEEEICMEWorkshop，20...

【论文笔记】ViLBERT:Pretraining Task-Agnostic VisiolinguisticRepresentations for Vision-and-Language Tasks

ViLBERT 统一多模态的新 BERT 预训练模型

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA

ThispaperpresentsaunifiedVisionLanguagePretrainingVLPmodel.Themodelisunifiedinthat 1itcanbefinetuned...

（阅读笔记）PARE：A Simple and Strong Baseline for Monolingual and MultilingualDistantly Supervis

来源：2022.ACLPARE：用于单语和多语远程监督关系提取的简单而强大的基线模型的优势： 1、模型简单； 2、每个token都可以和句子中的其他token交换信息（包内句子交换信息，充分利用包中...

《A Simple Baseline for BEV Perception Without LiDAR》论文笔记

参考代码：bevbaseline projectpage：ASimpleBaselineforBEVPerceptionWithoutLiDAR 1\.概述介绍：在这篇文章中提出了一种特别简单但效果...

是否确定退出登录?

【论文笔记】VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE

训练的三个阶段

四个实验任务

消融实验

C1 跳过预训练阶段 或 仅用文本进行训练

C3 BERT初始化

减少预训练任务（匹配）

全部评论 (0)

相关文章推荐

【论文笔记】VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE

2019 VisualBERT: a Simple and Performant Baseline for Vision and Language

论文笔记--SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural

论文笔记：A Simple and Effective Pruning Approach for Large Language Models

论文笔记：Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

DENSE POINT PREDICTION: A SIMPLE BASELINE FOR CROWD COUNTING AND LOCALIZATION（论文阅读笔记）

【论文笔记】ViLBERT:Pretraining Task-Agnostic VisiolinguisticRepresentations for Vision-and-Language Tasks

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA

（阅读笔记）PARE：A Simple and Strong Baseline for Monolingual and MultilingualDistantly Supervis

《A Simple Baseline for BEV Perception Without LiDAR》论文笔记

C1 跳过预训练阶段或仅用文本进行训练