LLaSM:Large language and speech model

阅读量：

Introduction

级联方法使用ASR将语音输入转化为文本输入，语音到文本会导致信息损失，本文提出LLaSM，一个具有跨模态对话能力的大型语音与语言模型，能够理解和遵循语音与语言指令，借鉴LLaVA，利用预训练的语音模态编码器和大语言模型，使用Whisper作为语音编码器，将语音信号转化为嵌入，然后，一个模态适配器学习将语音嵌入与大模型的输入文本嵌入对齐，将语音嵌入和文本嵌入串联起来形成交错序列输入到大语言模型中微调。

训练过程分为两阶段，第一阶段，使用公开的ASR数据集进行模态适应预训练，语音编码器和大语言模型冻结，仅训练模态适配器以对齐语音和文本嵌入。在这个阶段，由于大部分模型参数保持冻结，仅模态适配器的少量参数被训练。第二阶段，使用跨模态指令数据训练，语音编码器被冻结，而模态适配器和语言模型的参数进行更新，进行跨模态指令微调，LLaSM-Audio-Instructions指令数据集。从GPT-LLM，ShareGPT，WizardLM中挑选，通过文本转语音技术生成大量对话音频数据，总计包含199000个对话，其中包含80000个中文音频样本和428000个英文音频样本。

2.Approach

2.1 Model

Whisper编码音频，Chinese-LLAMA2-7B作为LLM。

预训练阶段。在这个阶段，模态编码器和LLM保持冻结。为了使LLM理解模态编码器的音频嵌入，使用公开的语音识别数据训练模态适配器，使文本和音频嵌入对齐。语音识别数据的数据样本（音频，文本）格式化为（简单指令，音频，文本）。

预训练多模态序列的统一格式如下所示：每个数据样本被格式化Xsample，然后将文本序列中的音频片段嵌入替换为模态适配器的音频嵌入，训练目标是预测每个数据样本的文本标签。

跨模态指令微调。这个阶段，仅冻结模态编码器，训练模态适配器和LLM，通过微软Azure的文本转语音api将人类提出的问题转成音频数据，训练目标是预测聊天机器人的响应，一轮问答会被处理成一个多模态序列Xsample，多轮问答则通过EOS结束符进行连接，如下图所示：

2.2 数据

全部评论 (0)

还没有任何评论哟~

LLaSM:Large language and speech model

Introduction 级联方法使用ASR将语音输入转化为文本输入，语音到文本会导致信息损失，本文提出LLaSM，一个具有跨模态对话能力的大型语音与语言模型，能够理解和遵循语音与语言指令，借鉴LLa...

WavLLM: Towards Robust and Adaptive Speech Large Language Model

本文是LLM系列文章，针对《WavLLM:TowardsRobustandAdaptiveSpeechLargeLanguageModel》的翻译。 WavLLM：迈向稳健和自适应的语音大语言模型摘...

《Speech and Language Processing》笔记（二）

摘抄： Theprocesswejustwentthroughwasbasedonfixingtwokindsoferrors:falsepositives,stringsthatweincorrec...

《Speech and Language Processing》笔记（一）

Noun 1.mimic临摹 2.mimicry模仿、模仿的技巧 3.genre类型、体裁、样式 4.diversion消遣、分散注意力 5.TextNormalizingmeansconvertin...

Speech and Natural Language Processing《资源教程》

SpeechandNaturalLanguageProcessing …image::https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fe...

Speech and Language Processing 阅读笔记 NLP

文章目录 2.1RegularExpressions 3Ngram 4NaiveBayesClassification 5LogisticRegression 6VectorSemanticsandE...

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

AnIntegrationofPreTrainedSpeechandLanguageModelsforEndtoEndSpeechRecognition https://arxiv.org/html/...

Language Model and Recurrent Neural Networks （二）

本文是我去年十月份在公司的团队技术分享会里面分享过的内容，分享这个内容的初衷是我发现自己对RNN本文均指RecurrentNeuralNetworks而非RecursiveNeuralNetworks...

speech-language-processing

SpeechandNaturalLanguageProcessing …image::<https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29f...

UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation

UniLM:UnifiedLanguageModelPretrainingforNaturalLanguageUnderstandingandGeneration @33rdConferenceonN...

是否确定退出登录?

LLaSM:Large language and speech model

全部评论 (0)

相关文章推荐

LLaSM:Large language and speech model

WavLLM: Towards Robust and Adaptive Speech Large Language Model

《Speech and Language Processing》笔记（二）

《Speech and Language Processing》笔记（一）

Speech and Natural Language Processing《资源教程》

Speech and Language Processing 阅读笔记 NLP

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Language Model and Recurrent Neural Networks （二）

speech-language-processing

UniLM: Unified Language Model Pre-training for Natural Language Understanding and Generation