Advertisement

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

阅读量:

Tasks

Visual Description Generation

Image Description Generation

Standard Image Description Generation

Dense Image Description Generation:旨在局部目标处生成描述

Image Paragraph Generation:生成段落

Spoken Language Image Description Generation:变写为说

Stylistic Image Description Generation:添加语言风格,例如幽默,

Unseen Objects Image Description Generation:

Diverse Image Description Generation:

Controllable Image Description Generation: manipulate items within an image for generating descriptions.

Video Description Generation

Global Video Description Generation:

Dense Video Description Generation: 类似于 Dense Image Description Generation

Movie Description Generation: movie clips are used as input

Visual Storytelling

Image Storytelling:

Video Storytelling:

Visual Question Answering

Image Question Answering

Video Question Answering

Visual Dialog

Image Dialog

Video Dialog

Visual Reasoning

Image Reasoning

Video Reasoning

Video Referring Expression

Image Referring Expression

Video Referring Expression

Visual Entailment

Image Entailment

Language-to-Vision Generation

Language-to-Image Generation
Based on sentence generation, the Language-to-Image Generation framework is designed to create image representations from text descriptions.

Image Manipulation(图像编辑):介绍该领域以指导图像的编辑过程,并确保其他文本区域保持无关内容不受影响;另一种方法是采用互动型的方式修改图像内容;还有一种方法是通过交流的形式进行调整。

Fine-grain Image Generation(细粒度的图像生成):

Sequential Image Generation(序列图像生成):基于一段文本内容(包含多个句子),系统会产出一系列图像实例,并将其转化为视觉呈现形式。这种视觉化过程类似于将故事情节转化为视觉呈现方式,并与传统的故事影像表达方式形成对比

Language-to-Video Generation

需要更强的条件生成器,因为需要考虑时间维度

Image and Language Navigation

Multimodal Machine Translation

Image Machine Translation: 利用源语句描绘一幅图像,并将其转换为目标语句

Multisource MMT:不同点:多种语言同时描述一副图像

Machine Translation with Video

Dataset

Image Description Generation

  • Flickr8K
  • Multi30K-CLID(German captions)

https://www.statmt.org/wmt16/multimodal-task.html

https://www.statmt.org/wmt17/multimodal-task.html

http://www.statmt.org/wmt18/multimodal-task.html

基于大规模数据集的概念捕获:https://ai.google.com/research/ConceptualCaptions/download

Video Description Generation

Microsoft Video Description (MSVD,涵盖中文、英语、德语等):http://www.cs.utexas.edu/users/ml/clamp/videoDescription

The MPII Cooking dataset (which includes 65 diverse culinary tasks) is accessible at https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset.

该研究库:https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus

Image Storytelling

New York City Storytelling (NYC-Storytelling):该数据集被划分为训练集占80%、验证集各占10%的比例,并可访问GitHub存储的位置https://github.com/cesc-park/CRCN

  • Disneyland Storytelling:数据集被分成0.8,0.1,0.1
  • SIND:大规模数据集,数据集被分成0.8,0.1,0.1

Video Storytelling

  • VideoStory
  • VideoStory-NUS

两个数据集都不是开源的

Image Question Answering

*VQA v1.0: answers are open-ended, either a single word term or a representative option selected from predefined options, https://visualqa.org.

  • VQA v2.0

Video Question Answering

Image Dialog

Video Dialog

Image Reasoning

Video Reasoning

Image Referring Expression

Real Images

Synthetic Images

Video Referring Expression

Image Entailment

Image Generation

Video Generation

没有开源的数据集

  • Text2Video

Machine Translation with Image

Machine Translation with Video

Miscellaneous

不是为了特定任务而设计的,间接有助于上述任务。

Visual Genome

  • Visual Genome analyzes the interactions and relationships between objects observed in an image.
  • The How2 dataset provides comprehensive resources for learning about software development practices, accessible at https://srvk.github.io/how2-dataset.
  • Berkeley Deep Drive eXplanation (BDD-X) focuses on autonomous driving capabilities, with the associated dataset available at https://github.com/JinkyuKimUCB/BDD-X-dataset.

Representation

Vision

Image Representation

  • global feature representation: 主要使用AlexNet、VGG、GoogLeNet、Inception-v3、Residual Nets(ResNet)以及DenseNets提取全局特征;然而,在涉及语言与视觉结合的任务中并不适合直接利用这些预训练的特征。
  • local feature representation: 被广泛应用于目标检测领域的R-CNN等模型

Video Representation

Language

常使用RNN, LSTM, BiLSTM, GRU, BiGRU, Transformer

Vision and Language

Visual Storytelling

Visual Dialog

Visual Reasoning

Visual Referring Expression

Visual Entailment

Language-to-Vision Generation

Vision-and-Language Navigation

Multimodal Machine Translation

Evaluation Measures

Common Measures

Language Metrics

机器生成的文本和参考文本的单词重叠程度

Bilingual Understudy for Neural Machine Translation (NMT)(BLEU): 作为机器翻译领域的核心工具之一,BLEU分数主要通过对比机器生成文本与参考译本或标准翻译的质量来进行评估,并广泛应用于视觉描述生成、视觉叙事构建、对话场景分析以及多模态机器翻译任务中。https://leimao.github.io/blog/BLEU-Score/, https://en.wikipedia.org/wiki/BLEU
Metric for Evaluation of Translation with Explicit Ordering (METEOR): 该评价指标主要用于衡量基于显式顺序的机器翻译性能,并在视觉描述生成、视觉叙事构建以及对话场景分析等领域发挥重要作用。https://en.wikipedia.org/wiki/METEOR
Recall Oriented Understudy for Gisting Evaluation (ROUGE): 作为信息检索领域的经典指标之一,在视觉描述生成相关的研究中表现尤为突出。http://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/What-is-ROUGE.pdf
Consensus based Image Description Evaluation (CIDEr): 主要针对图像描述生成任务中的评价框架,在图像描述生成评估、视频叙事构建以及叙事场景分析等方面具有重要应用价值。
Semantic Propositional Image Captioning Evaluation (SPICE):

Retriev al Metrics

  • Recall value at k * R@k
  • Median position * MedRank
  • Average reciprocal ranking * MRR
  • Average rank * Mean
  • Weighted cumulative gain metric * NDCG

Task-specific Metrics

Image Reasoning

  • Querying attribute (QA)
  • Compare Attribute (CA)
  • Compare Numbers (CN)
  • Count

Exist

Video Reasoning

  • Pointing
  • Yes/No
  • Conditional (Condit)
  • Attribute-related (Atts)

Language-to-Vision Generation

  • Inception Score (IS)
  • Fréchet Inception distance (FID)
  • R-precision

Vision-and-Language Na vigation

  • 路径长度(PL)
    • 导航错误(NE)
    • 成功率(SR)
    • 基于oracle的成功率(OSR)
    • 成功路径长度(SPL)

Human Evaluation

State-of-the-Art Results

Visual Storytelling Results

Image Storytelling

Video Storytelling

Visual Dialog Results

Image Dialog

Video Dialog

Visual Reasoning Results

Image Reasoning

Video Reasoning

Image Referring Expression

Image Referring Expression

Video Referring Expression

Visual Entailment Results

Image Entailment

Language-to-Vision Generation Results

Language-to-Image Generation

Language-to-Video Generation

Image-and-Language Navigation

Multimodal Machine Translation Results

Machine Translation with Image

Machine Translation with Video

Future Directions

  • Leveraging External Knowledge: 通过外部知识的有效应用来提升性能
  • Solving Challenges in Large-scale Data Handling: 解决大规模数据处理带来的挑战和机遇
  • Integrating Multiple Tasks: 整合多项任务以实现全面性能提升
  • Innovative Neural Architectures for Representation: 引入新型神经架构以优化表征能力
  • Image and Video Differences: 图像与视频之间存在显著差异,在视频与语言的结合方面仍需深入研究
  • Automated Evaluation Mechanisms: 实现自动化评估机制以提高效率

全部评论 (0)

还没有任何评论哟~