Advertisement

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

阅读量:

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

凭借其卓越的全局依赖建模能力,Transformer及其变体已成为许多视觉与语言任务的主要架构。然而,在视觉问答(VQA)与有向表达理解(REC)等任务中,默认情况下多模态预测通常需要从宏观到微观的视觉信息。因此,在Transformer中动态调度全局与局部依赖建模成为一个亟待解决的问题

二、Motivation

Within various V&L tasks, including visual question answering (VQA) and directed expressive comprehension (REC), multimodal reasoning typically demands visual attention from diverse receptive fields. It is not only essential for the model to grasp the overarching semantics but also critical to capturing intricate local connections to ensure accurate responses.

在这里插入图片描述

三、Model

(一)The framework of TRAR

在这里插入图片描述

(二)Routing Process

To effectively address flexible routing objectives across diverse examples, an intelligent network architecture is proposed by establishing a multi-branch configuration at each layer. Each layer features specialized modules designed for distinct operational parameters. Specifically, given fea_x0002_tures from the last inference stage (X ∈ R^{n×d}) and routing space (F = [F₀,…, Fₙ]), where F represents state vectors from preceding stages, outputs are computed through a structured process as follows:

在这里插入图片描述

Nevertheless, as demonstrated by the aforementioned equation, such a routing scheme inevitably leads to a network that is significantly more complex and substantially elevates the computational burden during training.

The central aim is to minimize the experimental burden through optimizing routing definitions. By revisiting the standard self-attention mechanism, which is defined as:

在这里插入图片描述

From our observations, SA functions as a feature update mechanism within a fully connected graph structure where A ∈ R^{n×n} represents a weighted adjacency matrix. This necessitates limiting the graph connections for each input element in order to capture varying attention spans. This approach is realized by multiplying the dot product result with the adjacency-based mask D ∈ R^{n×n}, which is demonstrated below.

在这里插入图片描述

Based on the above equations, a routing layer for SA is then defined as:

在这里插入图片描述

The above formula remains computationally intensive. As such, the author reduces complexity by redefining module selection as adjacency mask D's choice.

在这里插入图片描述

Within the TRAR framework, each self-attention (SA) layer is augmented with a path controller designed to aim at predicting the probabilities of routing paths, specifically associating with the module responsible for path selection. Given an input feature vector X∈R^n×d, the path probability distribution α∈R^n is formulated as:

在这里插入图片描述

(三)Optimization

By utilizing the Softmax function, routing path classification can be chosen as a continuous differentiable operation. The Router and the entire network can then integrate for end-to-end optimization via task-specific objective functions arg min_w,zL_train(w,z). During testing, the system dynamically integrates feature representations from various attention spans. Due to soft routing's lack of additional parameters, training remains relatively straightforward.

Hard routing aims at binary path selection, which further allows the introduction of specific CUDA kernels to accelerate model inference. However, classified routing makes the routing weight of the Router non-differentiable, and binarization of soft routing results may result in a feature gap between training and testing phases. To address this issue, the authors propose using the Gumbel-softmax Trick to enable differentiable path routing:

在这里插入图片描述

四、Experiment

  • To validate the developed TRAR method, the author utilized it in visual question answering (VQA) and referring expression comprehension (REC) tasks. Extensive experiments were performed on five benchmark datasets: VQA2.0, CLVER, RefCOCO, RefCOCO+, and RefCOCOG.

(一) Ablations

在这里插入图片描述

(二) Comparison with SOTA

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

(三) Qualitative Analysis

在这里插入图片描述
在这里插入图片描述

五、Conclusion

  • In this paper, the author investigates dependency modeling within Transformers for two visual-language tasks: Visual Question Answering (VQA) and Recognition (REC). These tasks typically demand visual attention from distinct spatial resolutions, which conventional Transformers are unable to address comprehensively.
  • To address these challenges, the authors introduce a lightweight routing mechanism termed Transformer Routing (TRAR), which enables each sample to dynamically adjust its attention span. This approach transforms the module selection problem into an optimized attention mask configuration task, thereby minimizing additional computational overhead and memory consumption.
  • To validate TRAR's efficacy, extensive experiments were conducted across five benchmark datasets. The empirical results demonstrate that TRAR significantly outperforms existing methods.

全部评论 (0)

还没有任何评论哟~