大语言模型综述 —— A Survey Of LLM，Large Language Models

阅读量：

语言建模

GPT：生成式预训练大规模语料库上的Transformer模型

语言建模发展的四个阶段

1. 统计语言模型(Statistical language modeling)：

2. 神经网络语言模型(Neural network language modeling)：

3. 深度神经网络语言模型(Deep neural network language modeling)：

4. 大型语言模型(Big language modeling)：

AI 理解论文：KEY POINTS

RLHF 强化反馈学习算法

上下文学习 vs 思维链算法

语言建模

语言建模近年来受到了广泛的研究。作为一种主要方法,语言建模在过去两个世纪中受到了广泛的研究。最近,人们开发了用于大规模语料库的大规模语言模型。

这些模型不仅可以显著提高性能,而且可以展示出一些具有特殊能力的特性。为了区分标准级别之间的差异,研究社区已经用大型语言模型(LLM) 的代价标记了一个术语。

具有语言理解和生成能力的人工智能算法 对于理解和掌握语言构成了重大挑战。在过去的20年中，语言建模被广泛研究，作为一种主要方法，它从统计语言模型 演化到神经网络语言模型 。

GPT：生成式预训练大规模语料库上的Transformer模型

最近，通过预训练大规模语料库 上的Transformer模型 来提出了预训练语言模型（PLMs） ，在解决各种自然语言处理任务方面表现出较强的能力。

研究人员发现，模型的规模可以提高性能，并通过增加模型大小进一步研究规模效应。有趣的是，当参数规模超过一定水平时，这些扩展的语言模型不仅可以实现显著的性能提升，而且还展示了小规模语言模型中不存在的一些特殊能力。为了区分参数规模的差异，研究社区为显著规模的PLMs创造了大型语言模型（LLM）这个术语。

最近，学术界和工业界对LLM的研究取得了重大进展，ChatGPT的推出引起了社会的广泛关注。 LLM的技术演进对整个人工智能社区产生了重要影响，将彻底改变我们开发和使用人工智能算法的方式。本文综述LLM的最新进展，介绍背景、主要发现和主流技术。特别是，我们关注LLM的四个主要方面：预训练、适应调整、应用和容量评估 。此外，我们还总结了开发LLM的可用资源，并讨论未来方向的剩余问题。

语言建模发展的四个阶段

在自然语言处理领域，语言建模（Language Modeling，LM） 对于处理文本数据和语音识别至关重要。语言建模可以作为训练语音识别、自然语言处理、机器翻译等算法的基础，而其发展历程可以分为以下四个阶段：

1. 统计语言模型(Statistical language modeling)：

20世纪80年代到90年代，LM的最初阶段基于传统的N-gram模型，并利用基于频率统计的方法进行训练，如Katz回退和Kneser-Ney平滑。

2. 神经网络语言模型(Neural network language modeling)：

2000年至2010年代，在此期间，LM从传统的N-gram模型转向了基于神经网络的语言模型。其中的代表模型有基于Bengio模型的神经概率语言模型(Neural Probabilistic Language Model, NPLM) 以及基于循环神经网络的语言模型(RNNLM)。

3. 深度神经网络语言模型(Deep neural network language modeling)：

2010年代后期，随着深度学习的普及，人们开始关注更深层次的神经网络。层次化的神经网络结构（例如深度循环神经网络结构）在语言建模领域得到广泛应用。Deep contextualized word representations（ELMo）和基于Transformer模型的Generative Pre-training Transformer（GPT）系列模型都是该阶段的代表模型。

4. 大型语言模型(Big language modeling)：

2010年代后期至今，由于深度神经网络计算能力和数据增大，研究者们致力于训练更大，更详细的语言模型。例如2019年Google Brain推出的Transformer-based(A.I.GPT)超大型语言模型"Generative Pretrain Transformer-3" (GPT-3)。未来或将发展像轻量级语言模型(Lightweight language model)、无监督学习型语言模型(Unsupervised-learning LM)等。

这些发展阶段中的新模型和算法显著提高了语言建模的准确性和可靠性，并为自然语言处理等众多应用带来了更好的效果。

AI 理解论文：KEY POINTS

https://static.aminer.cn/upload/pdf/1707/999/404/642a43bc90e50fcafd9b1555_0.pdf

1. 人类语言是一种表达和沟通的重要能力，在幼年时期开始发展，一生均可不断进步。

2. 对于机器而言，除非装备强大的人工智能技术，否则它们无法理解和沟通像人类一样。

3. 为了实现这个目标，一直以来是一个研究挑战，旨在使机器像人类一样阅读、写作和沟通。

4. 在技术上，语言模型 (LM) 是提高机器语言智能的一种主要方法。

5. LM 的目标是建模单词序列的生成概率，以便预测未来 (或缺失) 单词的概率。

6. LM 的研究在文献中得到了广泛的关注，大致可分为四个主要发展阶段。

7. •统计语言模型 (SLM). SLMs 是基于统计分析方法崛起的，成立于 1990 年代。

8. 基本思想是基于马尔可夫假设构建单词预测模型，例如预测最近上下文中的单词。

9. 对于固定上下文长度 n 的 SLM，也被称为 n-gram 语言模型，例如二元器和三元件器。

10. SLM 广泛应用于信息检索 (IR) 和自然语言处理 (NLP) 任务中，以提高任务性能。

11. 然而，它们经常受到维度诅咒的影响：需要大量的数据进行准确估计，因为需要估计指数级的转移概率。

12. 因此，特别设计的平滑策略，如退火估计和 Good-Turing 估计，被用来减轻数据稀疏性问题。

13. •神经语言模型 (NLM). NLMs 使用神经网络，例如循环神经网络 (RNNs),来建模单词序列的概率。

14. 一项重要的贡献是 [15] 中提出的单词分布表示概念，并通过聚合相关的分布式单词向量来建模上下文表示。

15. 通过扩展学习有效特征的概念，开发了通用神经网络解决方案，为各种 NLP 任务建立了统一的解决方案。

16. 此外，提出了 word2vec，建立了一个简单的浅层神经网络来学习分布式单词表示。

17. 这些研究开创了语言模型用于表示学习 (超越单词序列建模) 的先河，对 NLP 领域产生了重大影响。

18. •预训练语言模型 (PLM). 作为早期尝试，Elmo 提出了一种方法，首先通过预训练一个双向长短时记忆网络 (biLSTM) 网络来捕获上下文敏感的单词表示。

19. 相比于学习固定的单词表示，该方法能够更好地适应特定下游任务。

20. 进一步，BERT 基于具有自注意力机制的 Transformer 架构，通过预训练双向语言模型来捕获大规模无监督语料库中的上下文信息。

21. 这些预训练上下文敏感的单词表示非常有效地作为通用语义特征，显著提高了各种 NLP 任务的性能。

22. 这项工作启发了许多后续工作，开创了“预训练和微调”学习范式。

23. 在此范式下，许多研究开发了不同的预训练语言模型，例如 GPT-2 和 BART。

24. 在此范式下，通常需要对预训练语言模型进行微调，以适应不同的下游任务。

25. •大型语言模型 (LLM). 扩展 PLM(例如 GPT-3)

METHOD

1. Language is a prominent ability of human beings for expression and communication, which develops in early childhood and evolves over a lifetime.

2. For machines, they cannot naturally grasp the abilities of understanding and communicating in the form of human language, unless equipped with powerful artificial intelligence (AI) algorithms.

3. To achieve this goal, it has been a longstanding research challenge that enables machines to read, write, and communicate like humans.

4. Technicaly, language modeling (LM) is one of the major approaches to advancing language intelligence of machines.

5. LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens.

6. The research of LM has received extensive research attention in the literature, which can be roughly divided into four major development stages.

7. • Statistical language models (SLM). SLMs are developed based on statistical learning methods that rose in the 1990s.

8. The basic idea is to build the word prediction model based on the Markov assumption, e.g., predicting the next word based on the most recent context.

9. SLM with a fixed context length n are also called n-gram language models, e.g., bigram and trigram language models.

10. SLM have been widely applied to enhance task performance in information retrieval (IR) and natural language processing (NLP)

11. However, they often suffer from the curse of dimensionality: it is difficult to accurately estimate high-order language models since an exponential number of transition probabilities need to be estimated.

12. Therefore, specially designed smoothing strategies such as backoff estimation and Good-Turing estimation have been introduced to alleviate the data sparsity problem.

13. • Neural language models (NLM). NLMs characterize the probability of word sequences by neural networks, e.g., recurrent neural networks (RNNs).

14. As a remarkable contribution, the work in [15] introduced the concept of distributed representation of words and modeled the context representation by aggregating the related distributed word vectors.

15. By extending the idea of learning effective features for words or sentences, a general neural network approach was developed to build a unified solution for various NLP tasks

16. Further, word2vec was proposed to build a simplified shallow neural network for learning distributed word representations

17. These studies initiate the use of language models for representation learning (beyond word sequence modeling), having an important impact on the field of NLP.

18. • Pre-trained language models (PLM). As an early attempt, ELMo was proposed to capture context-aware word representations by first pre-training a bidirectional LSTM (biLSTM) network

19. Instead of learning fixed word representations) and fine-tuning the biLSTM network according to specific downsteam tasks.

20. Further, based on the highly parallelizable Transformer architecture with self-attention mechanisms, BERT was proposed by pretraining bidirectional language models with specially designed pre-training tasks on large-scale unlabeled corpora.

21. These pre-trained context-aware word representations are very effective as general-purpose semantic features, which have largely raised the performance bar of NLP tasks.

22. This work has inspired a large number of follow-up work, which sets the "pre-training and fine-tuning" learning paradigm.

23. Following this paradigm, a great number of studies on PLM have been developed, introducing either different architectures (e.g., GPT-2 and BART) or improved pre-training strategies (e.g., backoff estimation and Good-Turing estimation)

24. In this paradigm, it often requires fine-tuning the PLM for adapting to different downsteam tasks.

25. • Large language models (LLM). Researchers find that scaling PLM (e.g., GPT-2)

RESULT

这篇论文中的实验取得了以下结果:

1. 语言是人类表达和沟通的重要能力，在儿童早期发育，并随着年龄的增长而发展。

2. 对于机器来说，如果没有强大的人工智能算法，就无法自然地理解和沟通以人类语言形式。

3. 实现这个目标是一个长期的研究挑战，使机器能够阅读、写作和沟通像人类一样。

4. 语言模型 (LM) 是提高机器语言智能的一种主要方法。LM 的目标是建模单词序列的生成概率，从而预测未来 (或缺失) 单词的的概率。

5. LM 的研究在文献中受到了广泛关注，可以大致分为四个主要发展阶段。

6. • 统计语言模型 (SLM).SLM 是基于统计学习方法崛起的，在 1990 年代发展。其基本思想是根据马尔可夫假设构建单词预测模型，例如预测最近上下文中的单词。

7. SLM 具有固定上下文长度 n 也称为 n-gram 语言模型，例如二元型和三元型语言模型。

8. SLM 广泛应用于信息检索 (IR) 和自然语言处理 (NLP) 任务。

9. 但是，它们往往面临维度灾难：需要估计指数数量的变化概率，因此准确估计高阶语言模型很困难。

10. 因此，专门设计的数据稀疏化策略，如后退估计和 Good-Turing 估计，被用来减轻数据稀疏性问题。

11. • 神经语言模型 (NLM).NLM 使用神经网络来建模单词序列的概率，例如循环神经网络 (RNN)。

12. 作为一项重要贡献，[15] 工作引入了单词分布表示的概念，并使用相关分布单词向量来建模上下文表示。

13. 通过扩展学习有效单词或句子特征的想法，开发了通用神经网络解决方案，为各种自然语言处理任务建立了统一解决方案。

14. word2vec 工作提出了一种简单浅层神经网络来学习分布单词表示。

15. 这些研究开创了使用语言模型进行表示学习 (超越单词序列建模) 的新领域，对自然语言处理领域产生了重要影响。

16. • 预训练语言模型 (PLM).作为早期尝试，ELMo 提出了一种上下文感知的单词表示，通过先预训练一个双向长短时记忆网络 (biLSTM) 网络，而不是学习固定单词表示，再根据特定下游任务进行微调。

17. BERT 是基于高度可并行化的 Transformer 架构，使用双向语言模型预训练，并通过特别设计的任务进行微调，以解决文本表示任务。

18. 这些预训练的上下文感知单词表示非常有用，为自然语言处理任务大大提高了性能。

CONCLUSION

1. Language is a prominent ability of human beings for expression and communication, which develops in early childhood and evolves over a lifetime.