【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

阅读量：

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

凭借其卓越的全局依赖建模能力，Transformer及其变体已成为许多视觉与语言任务的主要架构。然而，在视觉问答（VQA）与有向表达理解（REC）等任务中，默认情况下多模态预测通常需要从宏观到微观的视觉信息。因此，在Transformer中动态调度全局与局部依赖建模成为一个亟待解决的问题

二、Motivation

Within various V&L tasks, including visual question answering (VQA) and directed expressive comprehension (REC), multimodal reasoning typically demands visual attention from diverse receptive fields. It is not only essential for the model to grasp the overarching semantics but also critical to capturing intricate local connections to ensure accurate responses.

三、Model

（一）The framework of TRAR

（二）Routing Process

To effectively address flexible routing objectives across diverse examples, an intelligent network architecture is proposed by establishing a multi-branch configuration at each layer. Each layer features specialized modules designed for distinct operational parameters. Specifically, given fea_x0002_tures from the last inference stage (X ∈ R^{n×d}) and routing space (F = [F₀,…, Fₙ]), where F represents state vectors from preceding stages, outputs are computed through a structured process as follows:

Nevertheless, as demonstrated by the aforementioned equation, such a routing scheme inevitably leads to a network that is significantly more complex and substantially elevates the computational burden during training.

The central aim is to minimize the experimental burden through optimizing routing definitions. By revisiting the standard self-attention mechanism, which is defined as:

From our observations, SA functions as a feature update mechanism within a fully connected graph structure where A ∈ R^{n×n} represents a weighted adjacency matrix. This necessitates limiting the graph connections for each input element in order to capture varying attention spans. This approach is realized by multiplying the dot product result with the adjacency-based mask D ∈ R^{n×n}, which is demonstrated below.

Based on the above equations, a routing layer for SA is then defined as：

The above formula remains computationally intensive. As such, the author reduces complexity by redefining module selection as adjacency mask D's choice.

Within the TRAR framework, each self-attention (SA) layer is augmented with a path controller designed to aim at predicting the probabilities of routing paths, specifically associating with the module responsible for path selection. Given an input feature vector X∈R^n×d, the path probability distribution α∈R^n is formulated as:

（三）Optimization

By utilizing the Softmax function, routing path classification can be chosen as a continuous differentiable operation. The Router and the entire network can then integrate for end-to-end optimization via task-specific objective functions arg min_w,zL_train(w,z). During testing, the system dynamically integrates feature representations from various attention spans. Due to soft routing's lack of additional parameters, training remains relatively straightforward.

Hard routing aims at binary path selection, which further allows the introduction of specific CUDA kernels to accelerate model inference. However, classified routing makes the routing weight of the Router non-differentiable, and binarization of soft routing results may result in a feature gap between training and testing phases. To address this issue, the authors propose using the Gumbel-softmax Trick to enable differentiable path routing:

四、Experiment

To validate the developed TRAR method, the author utilized it in visual question answering (VQA) and referring expression comprehension (REC) tasks. Extensive experiments were performed on five benchmark datasets: VQA2.0, CLVER, RefCOCO, RefCOCO+, and RefCOCOG.

（一） Ablations

（二） Comparison with SOTA

（三） Qualitative Analysis

五、Conclusion

In this paper, the author investigates dependency modeling within Transformers for two visual-language tasks: Visual Question Answering (VQA) and Recognition (REC). These tasks typically demand visual attention from distinct spatial resolutions, which conventional Transformers are unable to address comprehensively.
To address these challenges, the authors introduce a lightweight routing mechanism termed Transformer Routing (TRAR), which enables each sample to dynamically adjust its attention span. This approach transforms the module selection problem into an optimized attention mask configuration task, thereby minimizing additional computational overhead and memory consumption.
To validate TRAR's efficacy, extensive experiments were conducted across five benchmark datasets. The empirical results demonstrate that TRAR significantly outperforms existing methods.

全部评论 (0)

还没有任何评论哟~

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

TRAR:RoutingtheAttentionSpansinTransformerforVisualQuestionAnswering 一、Background Withitssuperiorglo...

Relation-Aware Graph Attention Network for Visual Question Answering论文解读

这篇论文的工作不同于现有的VQA系统（并与之兼容）。它以一个新的维度为中心：使用问题自适应的对象间关系丰富图像表示，以提高VQA性能。主要有三方面的贡献： 1、我们提出了一种新的基于图形的关系编码器，...

In Defense of Grid Features for Visual Question Answering论文阅读

InDefenseofGridFeaturesforVisualQuestionAnswering论文阅读 Abstract 作为“自底向上”关注，基于边界框（或区域）的视觉特征最近已经超过了普通的基...

读论文：Stacked Attention Networks for Image Question Answering

读论文：StackedAttentionNetworksforImageQuestionAnswering 文章目录一、概述二、SAN的模型结构 1.整体结构 2.内部结构 2.1Imagemod...

Hierarchical Question-Image Co-Attention for Visual Question Answering

当前基于视觉注意的一些VQA方法主要关注：”wheretolook”或者visualattention。本文认为基于问题的attention“whichwordtolistento”或者questio...

《Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge》论文解读

近日认真研读了一篇关于VQA的文章《TipsandTricksforVisualQuestionAnswering:Learningsfromthe2017Challenge》。

【多模态】《Hierarchical Question-Image Co-Attention for Visual Question Answering》论文阅读笔记

一、概述这篇文章做的是VQA 与之前介绍的几篇文章用问题query对图像做attention不同的是，这篇文章最大的亮点在于：在本文中，我们认为除了建模“看哪里”或视觉注意力之外，建模“听什么词”或...

Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering论文解读

1.引言这一篇总结已经写好很久了,一直没时间发到博客里.是因为这篇我完全是按照公式去推理的,怕大家对很长很长的公式反感,所以一直没在博客里. 但是我觉得这篇论文的思想还是蛮重要的,它提出的动...

Out of the Box: Reasoning with Graph ConvolutionNets for Factual Visual Question Answering论文解读

文章目录 1\.介绍 2\.VisualQuestionAnsweringwithKnowledgeBases 3\.实验论文链接:<https://arxiv.org/abs/1811.00538...

2021:An Improved Attention for Visual Question Answering

摘要注意力捕获模态内和模态间的依赖关系，可能已经成为解决视觉问答的最广泛使用的机制。本文中，我们提出一种改善的基于注意力的结构，我们在编码器解码器框架中加入一个AttentiononAttentio...

是否确定退出登录?

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

二、Motivation

三、Model

（一）The framework of TRAR

（二）Routing Process

（三）Optimization

四、Experiment

（一） Ablations

（二） Comparison with SOTA

（三） Qualitative Analysis

五、Conclusion

全部评论 (0)

相关文章推荐

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

Relation-Aware Graph Attention Network for Visual Question Answering论文解读

In Defense of Grid Features for Visual Question Answering论文阅读

读论文：Stacked Attention Networks for Image Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

《Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge》论文解读

【多模态】《Hierarchical Question-Image Co-Attention for Visual Question Answering》论文阅读笔记

Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering论文解读

Out of the Box: Reasoning with Graph ConvolutionNets for Factual Visual Question Answering论文解读

2021:An Improved Attention for Visual Question Answering