Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framwor
一、论文基础信息
青岛科技大学信息科学与技术学院的研究者在国际知名期刊《Speech Communication》上发表了题为《基于自注意机制和多层次融合框架的多模态语音情感识别》(Multi-modal speech emotion recognition based on self-attention mechanism and multi-scale integration framework)的研究论文

二、摘要

2.1 Backgroud
考虑到语音与情感之间的复杂性和变化性,在精确地解析或解读语音中的情感方面,则是一项既不可或缺又极具难度的任务。
2.2 Methods and experiments
(1)整体结构
A novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information.
提出了一种结合自注意机制和多尺度融合框架的基于语音和文本信息的多模态SER算法
(2)语音特征
A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions.
提出了一种自注意双向上下文LSTM (bc-LSTM),用于从语音中学习上下文敏感依赖。具体而言,BLSTM层用于学习长期依赖关系和话语级上下文信息,多头自我注意层使模型聚焦于与情绪最相关的特征。
(3)文本特征
A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text.
采用自注意多通道CNN (MCNN),利用静态和动态两种通道,从文本中学习一般特征和主题特征。
(4)特征融合
A multi-scale fusion strategy, including feature-level fusion and decision-level fusion , is applied to improve the overall performance.
采用多尺度融合策略,包括特征级融合和决策级融合,以提高整体性能。
2.3 Results
实验结果表明,在基准数据集IEMOCAP上,“我们的方法在加权准确性和非加权准确性方面分别获得了1.48%和3.00%的绝对提升”,相较于现有的策略。
三、引言
3.1 背景

情感计算的关键环节是情绪识别;它旨在从收集到的数据中分析情绪;包括人类语音、语音片段以及面部表情等数据。语音是一种丰富的情感信息来源;它包含着语言和副语言信息;传递着如情感等隐性信息。SER技术广泛应用于以下几个领域:(1)人类客户服务领域(Lee和Narayanan, 2005);(2)远程教育领域(Luo和Tan, 2007);(3)汽车驾驶系统领域(Schuller等人, 2004)。研究表明;人类更倾向于通过多模态而非单一模态来理解情绪(Shimojo和Shams, 2001;彭等人;Peng et al., 2021年;Hossain和Muhammad, 2018年)。因此;本文主要研究多模态SER技术
3.2 Feature extraction

特征提取是多模态SER系统的关键步骤之一,其目标在于为不同的情绪获取有效的特征表示。
在声学特征提取方面,基于低层描述子(LLDs) (Demircan and Kahramanli, 2016; Gharavian et al., 2012; Song, 1949)发展了多种广泛的特征集合,包括:
(1) Interspeech 2010 (Kayaoglu et al., 2015),
(2) GeMAPS (Eyben et al., 2017),
(3) AVEC-2013 (Schuller et al., 2013a),
(4) Com-ParE (Schuller et al., 2013b).
研究表明,情绪感知通常依赖于一定时间段内所表达的情绪信息。
近年来,利用高级统计函数(HSFs)将提取的lld转换为词性向量。
然而,上述所有提到的这些特征均为手工设计的特征,
在一定程度上无法有效地表示语音的时间动态特性(Wang et al. , 2020;舒乐问, 2018;Liu等, 2017)。
3.3 Solution

To alleviate these limitations, researchers have developed temporal modeling approaches capable of capturing sequential dependencies, including:
(1)the RNN架构 (Schmidhuber, 2015),
(2)长短期记忆网络(LSTM)(Sepp et al., 1997),以及
(3)受控 recurrent单元(Gated Recurrent Unit;GRU)(Chung et al., 2014).
However, these sequential models are limited in their ability to capture forward-directed temporal information alone; they fail to account for backward-directed temporal relationships that encode meaningful word interactions.
To address this challenge, recent advancements in bidirectional LSTM (BLSTM) architecture enable the integration of complementary information from both past and future contexts during feature extraction processes.
Simultaneously, exploring context-sensitive dependencies remains an active area of research within sentiment analysis (SAR).
In this domain, Poria等人(2017)提出了一种上下文LSTM(sc-LSTM)网络框架,该框架能够有效建模不同话语情境之间的语境相关依赖关系。
I would argue that incorporating both bidirectional and contextual features represents a promising direction for advancing sentiment analysis performance further.
3.4 Textual features

In the aspect of textual feature extraction for SER, traditional (Xu等人, 2019a)方法一般会采用带有标签的情感特征集合作为情感分析的基础工具进行语义分析.
3.5 Attention mechanisms

It is noted that all the previously mentioned methods seldom distinguish between emotional and non-emotional frames in the speech, thus bringing interference for SER.
To address this issue, attention mechanisms (Vaswani et al., 2017) have been applied to focus on the emotionally-relevant parts instead of the whole utterance.
Zhao et al. (2018a,b) proposed attention-based BLSTM with fully convolutional networks (FCN) in order to automatically learn the best spatio-temporal representations of speech signals for deep spectrum feature extraction on SER tasks.
需要注意的是,上述所有方法都很少区分言语中的情感和非情感帧,这给SER带来了干扰。为了解决这个问题,注意力机制(Vaswani et al., 2017)被应用于关注与情感相关的部分,而不是整个话语。Zhao等人(2018a,b)提出了基于注意力的BLSTM和全卷积网络(FCN),以自动学习语音信号的最佳时空表示,用于SER任务中的深度频谱特征提取。
However, the information selected by the attention mechanism is the expectation of all input information under the attention distribution, which greatly relies on external information. Recently, Te et al. proposed a framework that combines multi-task 3D CNN and selfattention mechanism to implement SER tasks, where the self-attention mechanism could capture longer temporal dynamics that typical RNNbased models.
而注意机制所选择的信息是注意分布下所有输入信息的期望,对外界信息的依赖性很大。最近,Te等人提出了一个将多任务3D CNN和自注意机制相结合的框架来实现SER任务,其中自注意机制可以捕获典型的基于rnn的模型更长的时间动态。
Self-attention mechanism focuses on the relationships between elements internally , thus reducing the dependence on external information, and capturing relevant information among features. In addition, the self-attention mechanism can be computed in parallel, which greatly improves the computational efficiency. In this paper, we adopt the self-attention mechanism to focus on the salient features.
自我注意机制关注内部元素之间的关系,从而减少对外部信息的依赖,在特征之间捕获相关信息。此外,自注意机制可以并行计算,大大提高了计算效率。在本文中,我们采用自我注意机制来关注显著特征。
3.6 Feature Fusion

We adopted a fusion strategy to integrate textual and acoustic features, aiming to achieve effective classification for SER.
In feature-level fusion, the goal is to combine features from different models to create more informative representation vectors.
This process involves taking features extracted from multiple models and fusing them into more comprehensive representation vectors that capture essential characteristics of the data.
For decision level fusion, multiple classifiers are trained to analyze and integrate the prediction results of individual classifiers dynamically.
Research has shown that both approaches are equally important and mutually reinforcing in improving SER classification performance (Farhoudi and Setayeshi, 2020; Yao et al., 2020).
3.6 contributions of this paper

The main contributions of this paper are summarized as follows:
(1) A self-attentional bc-LSTM network is proposed to capture both the utterance-level bidirectional and contextual information from speech.
(2) A self-attentional MCNN is proposed to extract the general features and thematic features for specific corpus from text.
(3) A multi-scale fusion framework, including feature-level fusion by concatenation and decision-level fusion based on Dempster–Shafer (DS) strategy, is proposed to integrate the results of three classifiers for recognizing different emotional states.
(4) Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of WA and UA, respectively. The rest of this paper is structured as follows.
本文的主要研究成果如下:
(1)提出了一种自注意bc-LSTM网络,用于从语音中获取话语层面的双向信息和上下文信息。(2)提出了一种自注意MCNN算法,用于从文本中提取特定语料的一般特征和主题特征。
(3)提出了一种多尺度融合框架,包括基于级联的特征级融合和基于Dempster-Shafer (DS)策略的决策级融合,整合了三种分类器识别不同情绪状态的结果。
(4)在基准数据集IEMOCAP上的实验结果表明,我们的方法在WA和UA方面比现有策略分别获得了1.48%和3.00%的绝对改进。
其中**Dempster-Shafer (DS)**是证据理论,关于证据理论可以参考下面两篇博客:
博客1:D-S envidence theory(DS 证据理论)的基本概念和推理过程
博客2:D-S证据理论学习笔记
4 提出的方法

4.1 语音特征提取
基于IS09特征进行提取,我怀疑作者使用的就是IS09特征。

Xi代表一个视频里所有句子对应的特征向量集合,Li代表该视频中句子的数量,k代表每条特征向量的空间维度,Xi,t表示第i个视频中的第t个句子.2.1.2 Self-attentional bc-LSTM 自注意bc-LSTM子网络由四个主要模块构成: (1) 基于Librosa工具包获取的声学特征作为BLSTM层的输入;(2) 通过BLSTM层实现话语级双向上下文信息提取;(3) 利用注意力机制识别特征中情感相关的部分;(4) 通过全连接层生成更高级别的语音表征,以便于后续分类任务的进行.

BLSTM层:为了应对复杂的语音情绪变化这一挑战,在本研究中采用了BLSTM层这一技术手段。该层通过前后双向LSTM网络结构,在之前提取的声学特征基础上进行深度学习,并能够捕获话语级上下文信息以及长期依赖关系。在时间步t时,在每一时刻t上进行处理:输入第t条语音数据后,在该时间步上进行状态更新操作

其中𝑠、𝑓、𝑜分别表示输入门、遗忘门和输出门的输出。𝑐~表示内存单元的候选值。𝑐表示更新后的内存单元状态。𝑦是LSTM单元格的输出。𝑾和𝑼表示权重矩阵。𝒃表示偏置。𝜎和tanh分别代表sigmoid函数和双曲正切函数。◦代表哈达玛积。
哈达玛积(Hadamard product)是矩阵的一类运算,若A=(aij)和B=(bij)是两个同阶矩阵,若cij=aij×bij,则称矩阵C=(cij)为A和B的哈达玛积,或称基本积。
双向LSTM网络的通过下面的公式进行更新:

该算法涉及对公式1至公式6的计算过程,并将公式(7)和(8)分别用于表示前向和后向隐藏层LSTM单元的输出结果。该模型采用了多头注意力机制来提升性能。

每个头的尺寸设置为𝑑ℎ𝑒𝑎𝑑=𝑑𝑙𝑠𝑡𝑚∕𝑁,其中𝑑𝑙𝑠𝑡𝑚为BLSTM网络的隐藏单元数
4.2 文本特征提取
4.2.1 Input textual embeddings
Textual modality is sourced from the transcription of spoken words.
Each utterance is represented by concatenating vectors derived from its constituent words.
Publicly accessible 300-dimensional word2vec vectors, sourced from Google News (Mikolov et al., 2013), are trained on a dataset comprising 100 billion words.
Word vector matrices serve as the input textual embeddings for the MCNN architecture.
4.2.2 Self-attentional MCNN

(对于文本特征提取流程图和文本描述部分有点差异,我个人觉得注意力层应该放在卷积层前面)
自注意MCNN子网络包括四个部分:
(1)使用词向量的文本嵌入作为嵌入层的输入。
(2)采用两个嵌入层提取当前语料库对应的一般特征和主题特征。
(3)利用注意层发现特征的情感相关部分。
(4)采用三个并行卷积层和池化层来获得更高层次的文本表示
We propose a self-attentional multi-channel CNN to further extract textual features from input textual embeddings, which consists of two
embedding layers, an attention layer, three convolutional layers and three pooling layers。
MCNN: 自注意多通道CNN从输入文本特征中进一步提取文本特征,由两个嵌入层、一个注意层、三个卷积层和三个池化层组成。

MCNN由静态嵌入层和动态嵌入层组成。其中,静态嵌入层未做任何修改;而动态嵌入层经过微调优化,并将这两部分分别视为两套并行的渠道处理。值得注意的是,在这种设计中,动态分支不仅整合了当前语料库的主题信息作为补充,并且用于补充静态分支的信息。两个分支均采用了经过预训练好的词向量作为初始权重设置。

设n为第n个句子的序号,并且有ei∈R^dcnn(其中dcnn表示词向量的空间维度)。在最后一部分中,在将两个特征输入到注意力机制时,请注意我的观点是应该使用卷积层而非注意力机制来处理这些特征。

这里我也觉得存在错误, 嵌入层输出应当作为卷积操作的基础, 根据流程图所示, 卷积操作前并未配置注意力机制. 将嵌入层提取出的各项特征依次输入三个不同的卷积核, 并使用ReLU激活函数来规避梯度消失问题, 然后将这些特征传递至注意力机制中进行处理.

本模型中采用多头自注意力机制作为核心组件,其目标是分析并提取句子内部词素之间的相互依存关系。接着,在编码器结构中设置了三个并行的下采样模块,其主要功能是减少信息维度以提高处理效率,随后通过逐级融合各子网络输出特征向量生成最终语义表征向量
4.3 Classifiers
涉及了三个分类器:文本分类器、语音分类器以及语音文本分类器;最终这三个分类器分别达到了128、128和64的维度。
4.4 Decision-level fusion layer based on DS
DS strategy of belief functions, also known as evidence strategy, is a well-established formalism for reasoning and making decisions with uncertainty.
信念函数的DS策略,也称为证据策略,是一种用于推理和做出不确定性决策的成熟方式(形式主义貌似不恰当,将formalism翻译为形式或者方式)。
It is based on representing independent pieces of evidence by completely monotone abilities and combining them using Dempster’s rule. In the last two decades, DS strategy has been widely applied to classifier fusion (Zhou et al., 2016).
它的基础是用完全单调的能力表示独立的证据片段,并使用登普斯特规则将它们组合起来。在过去的二十年中,DS策略被广泛应用于分类器融合(Zhou et al., 2016)。
In particular, the outputs of several classifiers are transformed into belief functions and fused by an appropriate combination rule (Liu et al., 2018). Therefore, DS strategy is very suitable for decision-level fusion of multiple classifiers.
具体而言,将多个分类器的输出转换为信念函数,并通过适当的组合规则进行融合(Liu et al., 2018)。因此,DS策略非常适合于多分类器的决策级融合。
As shown in Fig. 1, DS strategy is utilized to fuse the prediction results from three classifiers at decision level. The final prediction result 𝑚𝑡,𝑎,𝑏𝑖 is calculated as follows:


为了获得归一化系数K的值,请注意以下定义:其中maudio、mtext以及mbi被定义为基本概率分配的术语(即三个分类器的情感概率输出)。具体而言,在这种情况下:mt、a以及bi则分别表示当前预测结果在四个情绪类别上的分布情况。最后取这四个概率中的最大值作为最终的情绪预测结果。
5 Experiments
5.1 数据集
数据集:IEMOCAP的四类情感:

5.2 实验设置
Optimizer: Adam
learning rate: 0.001
Epochs: 100
Batch size: 256
Evaluation: UA & WA

5.3. 消融研究

探究文本特征提取模块时发现其具有一定的应用价值


对这四个方法进行了混淆举证对比分析,在后续两个消融实验中也得出了四个混淆矩阵的结果后发现其作用并不显著因而决定省略这部分内容

重点比较了语音特征提取模块在三种方案下的表现:(1)未采用自注意力机制;(2)未采用基于句子级语境的情感特征;(3)未采用反向LSTM结构。通过对实验数据进行分析发现,在实际应用中:将自注意力机制应用于情感识别时表现出色;通过提取句子级的情感关联特性显著提升了情感识别效果;而利用双向特征进行句子级建模进一步提升了预测能力

我们将在决策层上展开实验:具体包括以下三个维度:(1)完全摒弃双模态分类器;(2)排除基于文本的信息;(3)放弃语音信息的作用。而对于S1这一特定场景,则采用两个分类器协同工作。通过对比分析可知,在特征级层面的融合效果得到了充分验证:具体而言是语音与文本特征之间的融合。进一步地,在决策级层面的比较研究证实了我们的方法具有显著优势:即通过综合考虑不同层次的信息来进行最终判断。
5.4 与最好的方法进行对比

该研究者所提出的方案通过实验验证,在比较于目前领先的策略时,其WA与UA的绝对提升幅度分别为1.48%和3.00%。这些数据明确表明了该方案有望带来有前途的SER性能。
6 Discussion
To investigate which patterns are crucial in the attention mechanisms of speech processing, we randomly sampled data points from a given batch and analyzed the mean values for each type of original feature and those activated by the self-attention mechanism using a Student's t-test.


As shown in Figure 7 and Table 7, the test results revealed significant differences for MFCC (p = 0.046), ΔMFCC (p = 0.049), F0 (p = 0.024), and ΔF0 (p = 0.028) between the original features and those activated by the self-attention mechanism at a significance level of α = 0.05, indicating that the self-attention mechanism primarily highlights MFCC, F0, and their first-order differential coefficients, which aligns with the findings in Schuller et al. (2009).

In case (a), the self-attention-based mechanism emphasizes the words 'out' and 'control' when processing an angry emotional sentence. In case (b), the self-attention mechanism fails to emphasize any word in a neural emotional sentence.

In ©,
the self-attention mechanism emphasizes
the word 'worthless' in sad emotional sentences.
In (d),
the self-attention mechanism emphasizes
the word 'romantic' in happy emotional sentences.
The observations from these examples all conform to our intuition.
Additionally, it addresses
both its shortcomings and potential areas for future improvement.
Despite mentioning its advantages earlier,
this approach has faced challenges such as high computational time complexity and slow convergence speed during training.
Moreover, The primary criterion for judging this method's effectiveness is its memory usage; however, due to its high memory space requirement being one of its major drawbacks.

改进

7 Conclusion

后记
这篇论文涉及了若干前所未有的技术手段。其中一项是用于决策级融合的技术证据理论DS;另一项是利用T假设法来分析LSTM模型对特征的作用;第三项是对IEMOCAP测试集中四个样本进行可视化处理,以阐述文本自我注意力机制的工作原理。对于文本这种模态尚不熟悉。前面内容可能存在不足之处。以后有机会将进一步深入学习文本模态的应用。
