Advertisement

Linformer: Self-Attention with Linear Complexity

阅读量:

Linformer: Self-Attention with Linear Complexity

FAIR NIPS 2020


Abstract

​ Because of the standard self-attention mechanism of transformer uses O(n2)O(n^2) time and space with respect to sequence length, this paper demonstrates that self-attention mechanism can be approximated by a low-rank matrix and further proposes a new self-attention mechanism by introducing 2 projection matrices, which reduces the overall complexity to O(n)O(n) and the performance near the standard self-attention.

Introduction

Main problem: the complexity of self-attention of standard transformer which is O(n2)O(n^2)

Goal: to seek a new operation to avoid quadratic operation.

Pior work:
在这里插入图片描述

Backgrounds

​ Let denote the P matrix as this:
在这里插入图片描述

​ The Transformer uses P to capture the input context for a given token, in other words, the P is the attention weight. But for compute this ,we must cost quadratic time and space.

Self-Attention is Low Rank

​ This paper demonstrates that self-attention is low-rank both theoretically and empiriantly, i.e., the attention weight matrix P is low-rank. the experimental result is below:
在这里插入图片描述

​ In above left picture, we can see a clear long-tail spectrum distribution. This implies that most of the information of matrix P can be recovered from the first few largest singular values. In above right picture, this phenomenon is more obvious, meaning that, in higher layers, more information is concentrated in the largest singular values and the rank of P is lower.

​ Given this property of attention weight matrix P, we can use singular value decomposition to approximate P with a low-rank matrix P_low directly, as follows:
在这里插入图片描述

​ We need to use the first k singular values and corresponding singular vectors to approximate P with the e error, this way requires SVD operation which adds additional complexity, however. Therefore, this paper proposes another approach which uses learning projection matrix to avoid this added complexity.

Model

在这里插入图片描述

​ The main idea is to add two linear projection matrices E_i, F_i ( n * k) when computing key and value.

​ The bottom right In above picture,we can see how the dimentions change in the new mechanism. We can gain the context embedding for each head_i, as follows:
在这里插入图片描述

​ The dimention of the result context embedding is still n * d, but the time and space costing is linear. This paper giving some mathematical demonstations to improve this mechanism is reliable.

​ This paper suggests that Several additional techniques can be introduced on top of Linformer to further optimize for both performance and efficiency:

Parameter sharing between projections:

​ Headwise sharing、 Key-value sharing、 Layerwise sharing.

Nonuniform projected dimension:

​ We can choose a smaller projected dimension k for higher layers.

General projections:

​ One can also choose different kinds of low-dimensional projection methods instead of a simple linear projection.

Experiments

Pretraining

复制代码
 * Compared with RoBERTa which is based on the Transformer.
 * Dataset: BookCorpus plus English Wikipedia
 * Task: maskes language modeling
 * Measure: perplexity
 * Device: 64 Tesla V100 GPUs with 250k updates
在这里插入图片描述

Projected dimension k is effective, the performance is already nearly with the original Transformer when k = 128 for n = 512 and k = 256 for n = 1024. ((a) and (b))

We can decrease the number of additional parameters in our model because the performance of shared models almost matches that of the the non-shared model. ©

as sequence length increases, even though our projected dimension is fixed, the final perplexities after convergence remain about the same. (d)

  1. Downstream Results
    在这里插入图片描述

​ The performance of Linformer is near that of the origin Transformer and the performance of Linformer model is mainly determined by the projected dimension k instead of the ratio n=k.

  1. Inference-time Efficiency Results
    在这里插入图片描述

​ As sequence length increases,the inference-time speed-up and memory savings are even more dramatic.

全部评论 (0)

还没有任何评论哟~