Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
Tasks
Visual Description Generation
Image Description Generation
Standard Image Description Generation

Dense Image Description Generation:旨在局部目标处生成描述
Image Paragraph Generation:生成段落
Spoken Language Image Description Generation:变写为说
Stylistic Image Description Generation:添加语言风格,例如幽默,
Unseen Objects Image Description Generation:
Diverse Image Description Generation:
Controllable Image Description Generation: manipulate items within an image for generating descriptions.
Video Description Generation
Global Video Description Generation:

Dense Video Description Generation: 类似于 Dense Image Description Generation
Movie Description Generation: movie clips are used as input
Visual Storytelling
Image Storytelling:

Video Storytelling:

Visual Question Answering
Image Question Answering

Video Question Answering

Visual Dialog
Image Dialog

Video Dialog

Visual Reasoning
Image Reasoning

Video Reasoning

Video Referring Expression
Image Referring Expression

Video Referring Expression

Visual Entailment
Image Entailment

Language-to-Vision Generation
Language-to-Image Generation
Based on sentence generation, the Language-to-Image Generation framework is designed to create image representations from text descriptions.

Image Manipulation(图像编辑):介绍该领域以指导图像的编辑过程,并确保其他文本区域保持无关内容不受影响;另一种方法是采用互动型的方式修改图像内容;还有一种方法是通过交流的形式进行调整。
Fine-grain Image Generation(细粒度的图像生成):
Sequential Image Generation(序列图像生成):基于一段文本内容(包含多个句子),系统会产出一系列图像实例,并将其转化为视觉呈现形式。这种视觉化过程类似于将故事情节转化为视觉呈现方式,并与传统的故事影像表达方式形成对比
Language-to-Video Generation
需要更强的条件生成器,因为需要考虑时间维度

Vision-and-Language Navigation
Image and Language Navigation

Multimodal Machine Translation
Image Machine Translation: 利用源语句描绘一幅图像,并将其转换为目标语句

Multisource MMT:不同点:多种语言同时描述一副图像

Machine Translation with Video

Dataset
Image Description Generation
- Flickr8K


- Flickr30K-Entities:http://bryanplummer.com/Flickr30kEntities/


- MSCOCO:http://cocodataset.org/#home

- MSCOCO-Entities:https://github.com/aimagelab/show-control-and-tell

- STAIR(Japanese captions):http://captions.stair.center

- Multi30K-CLID(German captions)
https://www.statmt.org/wmt16/multimodal-task.html

https://www.statmt.org/wmt17/multimodal-task.html

http://www.statmt.org/wmt18/multimodal-task.html

基于大规模数据集的概念捕获:https://ai.google.com/research/ConceptualCaptions/download

Video Description Generation
Microsoft Video Description (MSVD,涵盖中文、英语、德语等):http://www.cs.utexas.edu/users/ml/clamp/videoDescription

The MPII Cooking dataset (which includes 65 diverse culinary tasks) is accessible at https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset.


- YouCook II:http://youcook2.eecs.umich.edu

- Textually Annotated Cooking Scenes (TACoS):http://www.coli.uni-saarland.de/projects/smile/page.php?id=tacos


- [_MPII Movie Description ]( MPII Movie Description ): https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/mpii-movie-description-dataset
- [ Montreal Video Annotation ]( Montreal Video Annotation ): https://mila.quebec/en/publications/public-datasets/m-vad
- [ MSR Video to Text ]( MSR Video to Text ): http://ms-multimedia-challenge.com/2017/dataset
- [ Videos Titles in the Wild ]( Videos Titles in the Wild ): http://aliensunmin.github.io/project/video-language/index.html#VTW
- [ ActivityNet Captions ]( ActivityNet Captions ): http://activity-net.org/challenges/2017/captioning.html
- [ ActivityNet Entities ]( ActivityNet Entities ): https://github.com/facebookresearch/ActivityNet-Entities
Image Storytelling
New York City Storytelling (NYC-Storytelling):该数据集被划分为训练集占80%、验证集各占10%的比例,并可访问GitHub存储的位置https://github.com/cesc-park/CRCN

- Disneyland Storytelling:数据集被分成0.8,0.1,0.1

- SIND:大规模数据集,数据集被分成0.8,0.1,0.1

- VIST:是SIND第二个版本,http://visionandlanguage.net/VIST
Video Storytelling
- VideoStory
- VideoStory-NUS
两个数据集都不是开源的
Image Question Answering
*VQA v1.0: answers are open-ended, either a single word term or a representative option selected from predefined options, https://visualqa.org.

- VQA v2.0

- OK-VQA:https://okvqa.allenai.org

Video Question Answering
- MovieQA系统:http://movieqa.cs.toronto.edu/home
- Television Question and Answer数据库:http://tvqa.cs.unc.edu
- 增强版Television Question and Answer数据库:http://tvqa.cs.unc.edu/download_tvqa_plus.html
Image Dialog
- VisDial:该平台提供一个视觉对话数据集(https://visualdialog.org/data)。
- CLEVR-Dialog:该开源工具由Satwik Kottur等人开发(https://github.com/satwikkottur/clevr-dialog)。
Video Dialog
- The Scene-Aware Dialog (AVSD):https://video-dialog.com
Image Reasoning
- Compositional Language and Elementary Visual Reasoning (CLEVR):https://cs.stanford.edu/people/jcjohns/clevr
- CLEVR-CoGenT:https://cs.stanford.edu/people/jcjohns/clevr
- GQA:https://cs.stanford.edu/people/dorarad/gqa
- Relational and Analogical Visual rEasoNing (RAVEN):http://wellyzhang.github.io/project/raven.html
Video Reasoning
Image Referring Expression
Real Images
- Reference COCO:<http%3A%2F%2Ftamaraberg.com%2Freferitgame>
- Reference COCO+,
- Reference CTC g:https://github.com/lichengunc/refer
- Reference Clef:<http%3A%2F%2Ftamarabberg.com%2Freferitgame>
- Guess What:https://guesswhat.ai
Synthetic Images
- CLEVR-Ref+:https://cs.jhu.edu/~cxliu/2019/clevr-ref+
Video Referring Expression
Image Entailment
- V-SNLI
- SNLI-VE:https://github.com/necla-ml/SNLI-VE
Image Generation
- Oxford-102:http://www.robots.ox.ac.uk/~vgg/data/flowers/102
- Caltech-UCSD Birds (CUB):http://www.vision.caltech.edu/visipedia/CUB-200-2011.html
- MSCOCO-Gen:
Video Generation
没有开源的数据集
- Text2Video
Image-and-Language Navigation
- Room-to-Room (R2R):https://bringmeaspoon.org
- AskNav:https://github.com/debadeepta/vnla
- Touchdown:https://github.com/lil-lab/touchdown
Machine Translation with Image
- Multi30K-MMT:https://www.statmt.org/wmt18/multimodal-task.html
Machine Translation with Video
Miscellaneous
不是为了特定任务而设计的,间接有助于上述任务。
Visual Genome
- Visual Genome analyzes the interactions and relationships between objects observed in an image.
- The How2 dataset provides comprehensive resources for learning about software development practices, accessible at https://srvk.github.io/how2-dataset.
- Berkeley Deep Drive eXplanation (BDD-X) focuses on autonomous driving capabilities, with the associated dataset available at https://github.com/JinkyuKimUCB/BDD-X-dataset.
Representation
Vision
Image Representation
- global feature representation: 主要使用AlexNet、VGG、GoogLeNet、Inception-v3、Residual Nets(ResNet)以及DenseNets提取全局特征;然而,在涉及语言与视觉结合的任务中并不适合直接利用这些预训练的特征。
- local feature representation: 被广泛应用于目标检测领域的R-CNN等模型
Video Representation
Language
常使用RNN, LSTM, BiLSTM, GRU, BiGRU, Transformer
Vision and Language
Visual Storytelling


Visual Dialog


Visual Reasoning

Visual Referring Expression


Visual Entailment

Language-to-Vision Generation


Vision-and-Language Navigation

Multimodal Machine Translation


Evaluation Measures
Common Measures
Language Metrics
机器生成的文本和参考文本的单词重叠程度
Bilingual Understudy for Neural Machine Translation (NMT)(BLEU): 作为机器翻译领域的核心工具之一,BLEU分数主要通过对比机器生成文本与参考译本或标准翻译的质量来进行评估,并广泛应用于视觉描述生成、视觉叙事构建、对话场景分析以及多模态机器翻译任务中。https://leimao.github.io/blog/BLEU-Score/, https://en.wikipedia.org/wiki/BLEU
Metric for Evaluation of Translation with Explicit Ordering (METEOR): 该评价指标主要用于衡量基于显式顺序的机器翻译性能,并在视觉描述生成、视觉叙事构建以及对话场景分析等领域发挥重要作用。https://en.wikipedia.org/wiki/METEOR
Recall Oriented Understudy for Gisting Evaluation (ROUGE): 作为信息检索领域的经典指标之一,在视觉描述生成相关的研究中表现尤为突出。http://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/What-is-ROUGE.pdf
Consensus based Image Description Evaluation (CIDEr): 主要针对图像描述生成任务中的评价框架,在图像描述生成评估、视频叙事构建以及叙事场景分析等方面具有重要应用价值。
Semantic Propositional Image Captioning Evaluation (SPICE):
Retriev al Metrics
- Recall value at k * R@k
- Median position * MedRank
- Average reciprocal ranking * MRR
- Average rank * Mean
- Weighted cumulative gain metric * NDCG
Task-specific Metrics
Image Reasoning
- Querying attribute (QA)
- Compare Attribute (CA)
- Compare Numbers (CN)
- Count
Exist
Video Reasoning
- Pointing
- Yes/No
- Conditional (Condit)
- Attribute-related (Atts)
Language-to-Vision Generation
- Inception Score (IS)
- Fréchet Inception distance (FID)
- R-precision
Vision-and-Language Na vigation
- 路径长度(PL)
- 导航错误(NE)
- 成功率(SR)
- 基于oracle的成功率(OSR)
- 成功路径长度(SPL)
Human Evaluation
State-of-the-Art Results
Visual Storytelling Results
Image Storytelling


Video Storytelling

Visual Dialog Results
Image Dialog


Video Dialog

Visual Reasoning Results
Image Reasoning

Video Reasoning

Image Referring Expression
Image Referring Expression

Video Referring Expression

Visual Entailment Results
Image Entailment


Language-to-Vision Generation Results
Language-to-Image Generation



Language-to-Video Generation

Vision-and-Language Navigation Results
Image-and-Language Navigation

Multimodal Machine Translation Results
Machine Translation with Image


Machine Translation with Video

Future Directions
- Leveraging External Knowledge: 通过外部知识的有效应用来提升性能
- Solving Challenges in Large-scale Data Handling: 解决大规模数据处理带来的挑战和机遇
- Integrating Multiple Tasks: 整合多项任务以实现全面性能提升
- Innovative Neural Architectures for Representation: 引入新型神经架构以优化表征能力
- Image and Video Differences: 图像与视频之间存在显著差异,在视频与语言的结合方面仍需深入研究
- Automated Evaluation Mechanisms: 实现自动化评估机制以提高效率
