《Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition》

阅读量：

Multi-modal系列论文研读目录

文章目录

Multi-modal系列论文研读目录
1.ABSTRACT
2.INDEX TERMS
3.INTRODUCTION
4.RELATED WORKS
- A. MULTIMODAL EMOTION RECOGNITION 多模态情感识别
- - 1. CONVENTIONAL FUSION METHODS 常规融合方法
  - 1. TRANSFORMER-BASED FUSION METHODS 基于变压器的融合方法
  - 1. REPRESENTATION LEARNING-BASED FUSION METHODS 基于表示学习的融合方法
- B. CROSS-MODALITYKNOWLEDGE TRANSFER 跨模态知识传递

*5. 方法论部分(Methodology)

- A. 单峰特征获取(Unimodal feature extraction)
B. 提出的融合方法(Proposed fusion method)
- 第一阶段(Stage 1): 不对准模态滤波(Misaligned modality filtering)
  - 第二阶段(Stage 2): 基于对齐模态的知识转移(Knowledge transfer from aligned modality)
  - 第三阶段(Stage3): 全损损失(Total loss)
  - 推断阶段(Inference)
6 实验设置
- 数据集
  - 评估标准（Evaluation Metrics）
  - 基准条件（Baselines）
  - 实现细节（Implementation Details）
7.RESULTS AND DISCUSSION 结果与讨论
- - A. OVERALL EFFECTIVENESS (RQ1) 整体有效性
- B. CONFIDENCE CALIBRATION STUDY (RQ2) 置信度校准研究
- C. CONFIDENCE NETWORK STUDY (RQ3) 可信网络研究
- D. QUALITATIVE OUTCOME REVIEW (RQ4) 验证结果评估
- 8.CONCLUSION

1.ABSTRACT

多模态情感识别研究领域具有重要意义，在视频平台背景下尤其突出。大多数现有模型致力于开发复杂的融合技术以整合异构功能特征。然而这些融合方法可能影响性能因为并非所有模态有助于情感预测语义对齐的理解。我们发现当输入模态之一被遮蔽时8.0%误分类实例的性能得以提升基于此我们提出了一种称为跨模态动态迁移学习（CDaT）的表征学习方法该方法通过单峰信息掩蔽及跨峰表征迁移学习动态筛选低置信度模态并用高置信度模块补充其缺陷。为此我们构建了一个辅助网络该网络能够学习模型置信度评分从而判断哪种模块处于低置信状态以及应从其他模块转移多少信息此外该方法可在任何模型间实现迁移因它依赖于基于概率的知识转移损失进行低层单峰信息间的迁移实验表明CDaT与四种不同级别的先进融合模型共同作用于CMU-MOSEI及IEMOCAP数据库上的情感识别任务取得了显著效果。

2.INDEX TERMS

Affective technology, cross-modality information exchange, model confidence level, multimodal emotion analysis

3.INTRODUCTION

近年来，在应用于视频平台（如YouTube、Twitch、TikTok等）的机器学习与信息融合取得显著进展的基础上

To answer these questions, we propose a representation learning method called Cross-modal DynAmic Transfer learning (CDaT) that dynamically adjusts misaligned modality. The proposed approach leverages fusion models to learn cross-modal knowledge transfer in a model-agnostic way. Based on the results of masking modality inference, we hypothesized that any change in logit outcome or class probability when masking a particular modality is evidence of misalignment. To capture this change, we propose a two-stage method: 1) misaligned modality detection and 2) modality knowledge transfer. First, we introduce the Misaligned Modality Filtering (MMF) stage that trains an additional network to estimate the instance-level modality confidence. It proportionally adjusts irrelevant modalities with other high-confident modalities. To make these adjust ments dynamic for each instance, models jointly learn the Probabilistic Knowledge Transfer (PKT) with divergence loss between features extracted from each modality encoder. The advantage of using PKT loss is that it does not require additional parameters for knowledge transfer between modalities. Unlike the general KT method, it can be used without specifying specific hyperparameters (e.g., temperature) even if the dimension between features differs [14].为了回答这些问题，我们提出了一种称为跨模态动态迁移学习（CDaT）的表示学习方法，该方法动态调整未对齐的模态。所提出的方法利用融合模型以模型不可知的方式学习跨模态知识转移。基于掩蔽模态推理的结果，我们假设当掩蔽特定模态时，logit结果或类概率的任何变化都是不对齐的证据。为了捕捉这种变化，我们提出了一个两阶段的方法：1）错位模态检测和2）模态知识转移。首先，我们介绍了未对齐模态过滤（MMF）阶段，该阶段训练额外的网络来估计实例级模态置信度。它与其他高置信度模态成比例地调整不相关模态。为了使这些调整对于每个实例是动态的，模型联合学习概率知识转移（PKT），其中从每个模态编码器提取的特征之间存在发散损失。使用PKT损失的优点是，它不需要额外的参数，用于模态之间的知识转移。与一般的KT方法不同，它可以在不指定特定超参数的情况下使用（例如，即使特征之间的尺寸不同[14]。
We conduct experiments on four baseline models to demonstrate the effectiveness of our proposed modelagnostic framework. The Naive Fusion model uses simple concatenation of all input-level modalities without representation learning for modality fusion. TFN [9] is an end-toend approach that pose multimodal sentiment analysis as modeling intra- and inter-modality dynamic for the first time. MISA [11] is a representation learning method that encodes modality-shared and -distinct spaces separately to overcome heterogeneity between modalities. TAILOR [8] is similar to the previous model but improves emotion recognition performance using a hierarchical cross-modal encoder and label-guided decoder based on Transformer architecture. We implemented the above four state-of-the-art models on CMU-MOSEI and IEMOCAP datasets, showed the overall performance improvement when applying CDaT on top of them, and experimentally analyzed the impact of different confidence measures.通过在四个基线模型的实验证明了所提模型不可知框架的有效性。朴素融合模型使用所有输入级模态的简单连接，而不进行模态融合的表示学习。TFN [9]是一种端到端方法，它首次将多模态情感分析作为模态内和模态间动态建模。MISA [11]是一种表征学习方法，其分别对模态共享空间和模态相异空间进行编码，以克服模态之间的异质性。TAILOR [8]与之前的模型类似，但使用基于Transformer架构的分层跨模态编码器和标签引导解码器来提高情感识别性能。我们在CMU-MOSEI和IEMOCAP数据集上实现了上述四个最先进的模型，展示了在它们之上应用CDaT时的整体性能改善，并实验分析了不同置信度的影响。
In this work, the novel contributions can be summarized as:
（1） We proposed CDaT, a novel multimodal emotion recognition method based on cross-modal confidence score. TheMMFstage solves the modality misalignment problem by training an additional network to estimate misalignment levels within each modality.
（2） We also introduce a dynamic PKT for mitigating the effects of semantically misaligned modality. The transfer model compares the outcome probability values between the two modalities and selectively learns that the features of the modality with lower confidence follow the feature distribution of the modality with higher confidence.
（3） Experiments on CMU-MOSEI and IEMOCAP two publicly available datasets for MER tasks, demonstrate consistent performance gain over state-of-the-art fusion models, proving the effectiveness of our model-agnostic approach.
在这项工作中，新的贡献可以概括为：
（1）我们提出了CDaT，一种新的多模态情感识别方法的基础上跨模态置信度得分。MMF阶段通过训练一个额外的网络来估计每个模态内的未对准水平，从而解决了模态未对准问题。
（2）我们还引入了一个动态的PKT，以减轻语义失调的模态的影响。转移模型比较两个模态之间的结果概率值，并且选择性地学习到具有较低置信度的模态的特征遵循具有较高置信度的模态的特征分布。
（3）在CMU-MOSEI和IEMOCAP两个公开的MER任务数据集上的实验表明，与最先进的融合模型相比，性能增益一致，证明了我们的模型无关方法的有效性。

A. MULTIMODAL EMOTION RECOGNITION 多模态情感识别

多模态情感识别（MER）是一个研究领域旨在通过各种数据类型——如语音、文本和面部表情——来理解人类情绪。其动机在于人类如何利用多种信息源来感知并表达情绪[1]、[2]。有多种方法可用于结合不同模态的数据以实现MER。

1) CONVENTIONAL FUSION METHODS 常规融合方法

Conventional fusion methods integrate the features from various modalities using techniques such as weighted sum, concatenation, or averaging[3],[15],[16],[17].Despite their limitations, these traditional methods often exhibit subpar effectiveness due to their necessity of addressing both the discrepancies and interactive relationships among modalities.

2) TRANSFORMER-BASED FUSION METHODS 基于变压器的融合方法

With the development of attention mechanisms [18], deep learning models have demonstrated notable effectiveness. Transformer-based fusion techniques concentrate on identifying key features within each modality and establishing correspondences across different modalities. For instance, MulT [19] employs cross-attention mechanisms that enable it to focus on one modality's features based on another's information. Similarly, MAG [20] incorporates a gating mechanism that allows it to capture and integrate distinct modalities' characteristics. However, this approach overlooks an essential aspect: it fails to account for the complementary nature of different modalities prior to their integration.

基于表示学习的融合方法

To address the heterogeneity discrepancy between modalities, recent fusion methods aim to learn representations that exhibit invariance or independence from the modalities. The notable methods include TFN [9], which employs tensor-based fusion, and MIM [21], which maximizes mutual information to attain modality-invariant representations. Additionally, Self-MM [22] innovates by utilizing self-supervised learning techniques to generate modality-independent emotional labels. These techniques collectively enhance cross-modal compatibility through their integration. While these approaches strive to reflect multi-modal consistency, there remains a limitation in dynamically addressing scenarios where single-modality masking outperforms multi-modality fusion.

B. CROSS-MODALITYKNOWLEDGE TRANSFER 跨模态知识传递

Knowledge Transfer (KT) techniques have been proposed to compensate for complex and large models’ challenges and increase the performance of lightweight neural networks [23], [24]. It involves transferring knowledge from a complex teacher model to a simpler student model by imitating the teacher’s outputs or modified versions. However, existing KT methods have several limitations that cannot directly between layers of different architecture/dimensionality, so they are tailored towards classification tasks. Probabilistic KT (PKT) technique overcomes these limitations by matching the probability distribution of the data in the feature space [14]. PKT approach can apply to cross-modal KT, which specifies different modalities as teacher and student model even if the dimensions are of different sizes, as well as transferring from deep neural networks.知识转移（KT）技术已经被提出来弥补复杂和大型模型的挑战，并提高轻量级神经网络的性能[23]，[24]。它涉及通过模仿教师的输出或修改版本将知识从复杂的教师模型转移到更简单的学生模型。然而，现有的KT方法有几个限制，不能直接在不同的架构/维度的层之间，所以它们是针对分类任务定制的。概率KT（PKT）技术通过匹配特征空间中数据的概率分布克服了这些限制[14]。PKT方法可以应用于跨模态KT，它将不同的模态指定为教师和学生模型，即使维度大小不同，以及从深度神经网络转移。
This work has leveraged applications of Cross-Modal Distillation (CMD) in diverse fields such as computer vision [25], video representation [26], action recognition [27], [28], and bi-modal emotion recognition [29]. Despite these successes in diverse domains,CMD has not yet been explored for multimodal emotion recognition tasks. In addition, we attempted to account for heterogeneity distributions between modalities by using masking techniques during the transfer process.这项工作利用了跨模态蒸馏（CMD）在不同领域的应用，如计算机视觉[25]，视频表示[26]，动作识别[27]，[28]和双模态情感识别[29]。尽管在不同领域取得了这些成功，但CMD尚未被探索用于多模式情感识别任务。此外，我们试图通过在传输过程中使用掩蔽技术来解释模态之间的异质性分布。

5.METHODOLOGY 方法

Within this segment, we offer a concise overview of the MER process and introduce our innovative Cross-modal Dynamic Transfer learning (CDaT) methodology. This model-independent strategy integrates linguistic, visual, and acoustic modalities while effectively removing the impacts of unaligned sources. The CDaT framework encompasses two key stages: first, the dedicated stage for selecting misaligned modalities; second, the phase facilitating bidirectional knowledge transfer. The entire process is illustrated in Figure 2.

MER aims to forecast the multiple emotional classes that emerge in an utterance across three modalities (text, video, and audio). Each data instance corresponds to a scene derived separately from distinct videos, each annotated with multiple target emotions (multi-label classification task). The input sequences for each modality are denoted as Xm ∈ RTm×dm, where m ∈ {t, v, a} signifies text, video, and audio modalities. Herein, Tm and dm represent the sequence length and dimension size for each modality. The target labels are represented as Y = {y1, y2, ..., yL} ∈ R1×L, where L denotes the label length. The training dataset is structured as D = {(Xt,v,a_i, Yi)}_{i=1}^N, where N is the number of training samples and Yi ⊂ Y represents the labels assigned to instance i.

A. UNIMODAL FEATURE EXTRACTION 单峰特征提取

大多数多模态融合模型首先从每个原始模态信号中提取特征，并无间隙地将它们转换至潜在空间。输入数据中的每个模态都经受不同的编码器处理，并被转译为低维特征表示hm ∈ R^dm。在先前的研究中[9]、[11]及[21]中,BERT已被成功应用于文本信息的抽取,sLSTM则被用来整合视觉与语音信息（具体而言,ht = BERT(Xt), ha = sLSTM(Xa), hv = sLSTM(Xv)）。近期研究者们开始关注并采用基于变换器架构的方法来进行各个模态信息的抽取,从而保持序列维度RTm×dm这一特性。

B. PROPOSED FUSION METHOD 拟定融合方法

The goal of CDaT is to identify and control the misaligned modalities before the fusion. Despite using a sophisticated network architecture for the fusion, detecting semantic misalignment between modalities in the fusion network is challenging, as we observed in Figure 1. Therefore, the proposed method aims to detect misaligned modalities and dynamically adjust knowledge transfer from irrelevant to confident ones by refining low-level representations. To obtain the refined unimodal features, a transfer encoder Gm is trained as:
CDaT的目标是在融合之前识别和控制未对准的模态。尽管使用了复杂的网络架构进行融合，但检测融合网络中模态之间的语义不一致仍然具有挑战性，如我们在图1中所观察到的。因此，该方法的目的是检测错位的模态和动态调整知识转移从不相关的信心，通过细化低层次的表示。为了获得细化的单峰特征，传递编码器Gm被训练为如下图所示：

其中hm是传递信息的一种表示。
CDaT包括两个阶段：
1）从低层特征中筛选出无关模态，
2）基于未对准权重动态调节各对模态间的知识转移程度。
通过知识转移过程优化低级别特征表达。
知识转移过程是基于未对齐模态组合的Kullback-Leibler（KL）发散损失进行训练。
对于转移编码器Gm而言，
其知识转移损失定义为：

where M represents all modalities {t,v,a}.
其中我们定义M为所包含的所有模态{t、v、a}.
我们的优化目标是使该模型能够有效识别这些特征.
我们的实验表明该模型在多个基准数据集上均取得了良好的效果.

1) STAGE 1: MISALIGNED MODALITY FILTERING 未对准模态滤波

In the initial stage, CDaT endeavors to identify misaligned modalities from low-level features hm by calculating confidence scores while masking a specific modality. This process also establishes an instance-level misalignment weight wm. One feasible strategy to compute modality confidence scores is by consistently tuning the misaligned weight with uniform values assigned to all modalities. Consequently, a constant value of 1 is applied to all confidence scores (i.e., wm = 1). Assigning a fixed value to the misalignment weight implies that cross-modal knowledge transfer occurs without considering modal confidence, effectively mirroring the traditional PKT method [14], which employs transfer learning across all modalities. However, this indiscriminate application of inter-modality PKT can negatively impact performance, as well-aligned modalities are less influenced by poorly predictive ones. To address this limitation, we introduce two calibration techniques for computing modality confidence scores: 1) using cross-entropy and 2) employing confidence networks. The first approach measures modality misalignment by comparing probability distributions of emotional labels when masking a specific modality and integrating all modalities. To quantify differences between logit representations z and zm (where zm represents masked modality fusion), cross-entropy is utilized as follows:

The cross-entropy reflects the model confidence derived from z and zm by using the prediction result of the classifier. The cross-entropy score indicates how much information is lost when masking modality m. Therefore, we use it as a confidence score for the masked modality m, assuming the higher cross-entropy means higher importance of modality m for emotion recognition.交叉熵反映了通过使用分类器的预测结果从z和zm导出的模型置信度。交叉熵分数指示当掩蔽模态m时丢失了多少信息。因此，我们将其用作掩蔽模态m的置信度分数，假设较高的交叉熵意味着模态m对于情感识别的较高重要性。
To apply the model results of emotion recognition directly to modality misalignment filtering, an auxiliary network, called ConfidNet, is used to estimate the model confidence as proposed by [30]. It takes the output probabilities of the emotion recognition model as confidence and predicts a confidence score for each emotion category. The confidence score reflects how trustworthy the model predictions are for a given input. We train the ConfidNet to approximate the target value C∗, which is defined as:为了将情感识别的模型结果直接应用于模态未对准滤波，使用称为ConfidNet的辅助网络来估计模型置信度，如[30]所提出的。它将情感识别模型的输出概率作为置信度，并预测每个情感类别的置信度得分。置信度分数反映了模型预测对于给定输入的可信度。我们训练ConfidNet以近似目标值C*，其定义为：

其中z是融合网络生成的输入特征向量,y是真实情感标签,L₁是预测情感标签,L是真实情感标签的数量.在测试阶段,目标值C未知.因此,真实情感类别在测试集上的平均预测概率被用作目标估计.对于训练损失,模型通过优化ConfidNet输出与目标值C之间的L2损失最小化来学习参数θ.该网络采用多层结构设计,并使用ReLU激活函数.根据方程5所示,优化过程旨在调整未对齐的权重参数,以实现最佳性能

第二阶段：基于对齐的模态的知识转移

该研究提出了一种称为CDaT的方法旨在动态调整低级别的功能失调模态。该方法通过评估各模态的置信度从一个更具信心的源模态向目标模块转移知识。具体而言该系统通过最小化两种不同模块之间的分布差异来实现这一目标并且会将具有较低置信度的特征hm转换为其细化版本h m如图3所示。在源模块m与目标模块n之间进行知识转移我们将其定义为KT(m n)并采用概率分布间的KL散度作为衡量标准从而使得KT(m n)得以被数学上形式化表示

其中P(hm)代表低级特征hm的概率分布,Q(hn)则代表hn的概率分布在使用输入模态集合M={t,v,a}中的各个元素进行建模时,我们假设各模式之间存在一定的关联关系.通过基于条件分布的方法计算得到,P(hm)与Q(hn)之间的关系依赖于样本与其邻居之间配对余弦相似性的度量.在此基础上,我们提出了一种跨模式的概率知识转移损失函数.

In this work, because the KL divergence loss is calculated in a non-symmetric way, cross-modal knowledge transfer is conducted in a bidirectional way. For example, if there are three modalities (e.g., text, video, and audio), LPKT added up loss calculations of six orders {tv, ta, vt, va, at, av}. To solve the dynamic adjustment of misaligned modality, we multiply wm, calculated as the confidence value in equation 6, by each loss. Therefore, a weight adjustment factor s compares the confidence scores between two modalities when masking them.在这项工作中，由于KL发散损失的计算是在一个非对称的方式，跨模式的知识转移进行了双向的方式。例如，如果存在三种模态（例如，文本、视频和音频），LPKT将六阶{tv，ta，vt，va，at，av}的损失计算相加。为了解决未对准模态的动态调整，我们将在等式6中计算为置信度值的wm乘以每个损失。因此，权重调整因子s在掩蔽它们时比较两个模态之间的置信度分数。

通过使用置信度分数wm及其调整因子s, 这些损失被动态地建模为:

where n不等于m.

3) TOTAL LOSS 全损

By the end, CDaT is capable of learning both the original emotion recognition task and dynamic cross-modal transfer through simultaneously optimizing two loss functions.

where α是确定跨模态知识转移程度的相对权重这一超参数.在本研究中,α根据每个融合模型的经验情感识别结果手动确定.

4) INFERENCE

In this study, after jointly training a fusion model incorporating various losses, multi-modal input data re-passes through the described architecture of III-A to obtain outcomes for the test set. The additional network introduced here is exclusively utilized during the training phase to quantify dynamic transfer losses between each modality pair; consequently, it is omitted from inference. Instead, it undergoes transmission through a cross-modal transfer encoder G and a fusion encoder F to achieve a unified representation:

对于视觉和音频信息，并基于BERT在文本模态中提取单模态特征而非基于sLSTM技术进行单模态特征提取，在此总结而言，融合模型通过统一表示获得每个标签的预测概率。

6.EXPERIMENTAL SETTING

This section covers the statistics of datasets, evaluation metrics, and baseline methods employed in our experiments with the aim of demonstrating the performance and effectiveness of the proposed approach. The details of the experimental settings are drawn upon findings from prior studies on data integration [8], [11].

A. DATASET

在本研究中

B. EVALUATION METRICS 评估指标

In accordance with the approach outlined by prior studies [8], this study employs four quantitative metrics to assess CDaT's performance. When evaluating multi-label classification tasks, this study employs accuracy (Acc) and F1 score (F1). The accuracy metric reflects how accurately predictions align with actual labels across all categories, while the F1 score integrates both recall and precision. The F1 score is computed using a micro-level approach, with a prediction threshold set at 0.5. To ensure reliable comparisons between different models, this study also calculates precision § and recall ® metrics.

C. BASELINES

To validate the performance enhancement of our model-agnostic approach, we employ multiple fusion frameworks to conduct comprehensive experimental analyses. 为了验证我们的模型无关方法的性能提升, 我们采用多个融合框架来开展全面实验分析.

朴素融合通过简单的级联将各模态融合在一起。利用输入数据，在潜在空间中将每个模态分别编码为对应编码器的作用域。为了整合这些表示方法（如级联、求和、乘法等）以及MLP层被用来进行预测。
TFXN [9]是一种用于多模态情感分析的3D张量融合架构。可学习网络由所有可能组合构成的不同模态，并通过连接这些组合来生成最终表示。
MISA [11]是一种多模态融合模型，在情感分析任务中关注的是各模式表征空间方面的表现。其中一个是模式无关的空间，在学习不同模式间的共同因素以减少模式差异性的同时构建起来；另一个是特定于各模式的空间，在这种空间中各模式的独特信息能够反映其内部因素的表现出来；最后这两种表征经过注意力机制整合在一起形成最终结果。
TAILOR [8]与MISA相似地使用两个模式空间（即模式无关空间与特定于各模式的空间）但在处理任务上TAELOR采用了标签引导解码器来进行情感识别任务；在整合阶段TAELOR采用了基于自注意机制的分层跨模式编码器将两种空间表示传递给解码器并结合整个标签嵌入来进行推断；值得注意的是原始论文中报告TAELOR在MER任务上的性能时误用了多标签二进制分类阈值为0.3这一不合理的设定我们将其重新设置为阈值0.5

D. IMPLEMENTATION DETAILS

在MISA实施中使用CMU-MultimodalSDK工具进行MOSEI基准测试。此外，在Naive Fusion、TFN以及TAILOR中采用了对齐并经过预处理的数据（如[8]所示）。辅助训练损失的超参数来源于MISA [11]的研究成果。其中文本、视频以及音频的尺寸分别为300、35和74单位长度。隐藏表示的大小为df = 256应用于TAILOR与TFN模型中，并为MISA模型提供128维表示空间；而Naive Fusion则采用各输入模态维度之和作为隐藏表示大小的基础数值。Adam优化器被用于所有模型参数的学习过程，在融合网络中初始学习率为5e-5且丢失率为0.6进行配置；然而Naive Fusion与TFN模型仅采用学习率设置为1e-4的方式进行优化配置。此外还实现了学习率衰减调度器以优化训练效果。MISA实验设置共计进行了40个周期的训练流程；而对于其他基线模型则采用了50个周期的训练策略以达到最佳性能效果

7.RESULTS AND DISCUSSION 结果和讨论

We conducted experiments to investigate four reasonable questions:
• (RQ1) When applied to multimodal emotion recognition tasks in a model-agnostic manner, can the CDaT improve the performance of existing fusion models?
• (RQ2) How does the approach measure the confidence level of each modality?
• (RQ3) To what extent do hyperparameters influence model performance?
• (RQ4) Can we demonstrate the effectiveness of CDaT through qualitative analysis?
The following descriptions offer a detailed walkthrough of each experiment's results and analysis.

A. OVERALL PERFORMANCE (RQ1) 整体性能

Our approach has been demonstrated to be applicable to any multimodal fusion model in the context of a multi-label classification task (RQ1). Through evaluation, the quantitative results of each model are presented in Tables 2 and 3. Each model was re-implemented as a multi-label classification task and trained using standard cross-entropy loss, despite its original regression architecture. As outlined in IV-C, the threshold for actual class probability was set at 0.5 to ensure consistent application of the multi-label binary classification loss across all models during reimplementation.

From our extensive experiments, we observe that CDaT consistently achieves superior performance across multiple metrics, including accuracy and F1 score, when evaluated against existing fusion models. Notably, this evaluation was conducted on two benchmark datasets: CMU-MOSEI and IEMOCAP. While CDaT surpasses all other models in terms of accuracy and F1 score, it shows particular promise in handling complex tasks such as emotion recognition in multi-modal settings. However, despite its advantages, we note that Naive Fusion model's performance remains relatively underwhelming due to inherent limitations in its design. This finding underscores the importance of integrating advanced techniques like cross-modal transfer learning into traditional models to enhance their predictive capabilities. Our experimental results demonstrate that CDaT's ability to leverage cross-modal information leads to significant improvements in model performance compared to conventional approaches like TFN and TAILOR. Furthermore, while IEMOCAP presents a more challenging dataset due to its larger label set, CDaT still manages to deliver a notable improvement over these baseline methods by effectively incorporating modality-specific confidence scores into its learning process.

B. MODALITYCONFIDENCE CALIBRATION ANALYSIS(RQ2) 模态置信度校准分析

We proposed three metrics (i.e., constant, cross-entropy, and confidence network) for modality confidence calibration in Section III-B1. Our experiments conducted transfer learning on each of the confidence calibrations of the TAILOR fusion model to address RQ2. The assessment results are presented in Tables 4 and 5, labeled as CD-a-T-Const for constant-based, CD-a-T-Cross-Entropy for cross-entropy-based, and CD-a-T-Confid-NET for confidence network-based methods. First, it is notable that both CD-a-T-Cross-Entropy and CD-a-T-Confid-NET outperformed the baseline across all benchmarks. This suggests that not only does dynamic knowledge transfer between modalities prove beneficial, but it also complements any representation fusion architecture employed for emotion recognition tasks. Table 2 demonstrates that performance gains are significantly greater when employing dynamic adjustments (e.g., CD-a-T-Cross-Entropy and CD-a-T-Confid-NET) compared to a static approach using a constant filtering metric (e.g., CD-a-T-Const). On the other hand, Table 3 reveals that the Confid-NET method does not exhibit a significantly higher effectiveness than alternative confidence calibration techniques under the IEMOCAP benchmark. The confidence score regression model shows weaker performance compared to CMU-MOSEI on the IEMOCAP dataset (as shown in Figure 4). Therefore, it is crucial to employ a reliable confidence metric to facilitate cross-modal KT application within multimodal emotion recognition systems.

C. STUDY ON THE CONFIDENCE-NETWORK ARCHITECTURE FOR RQ3

Given the results in Figure 4b, we examine the model performance of confidence networks across varying numbers of layers and hidden sizes for RQ3. Essentially, we underscore the effectiveness of the ConfidNet architecture in enhancing CDaT performance. In Table 6, the bolded results denote optimal performance across all scenarios, while the underlined text highlights specific cases per layer configuration. Experimental findings reveal that the number of layers has a significant impact on confidence network performance and CDaTConfidNet's final outcomes. However, hidden size does not demonstrate a substantial influence. Although minor effects are observed with hidden size, increasing its capacity becomes increasingly beneficial for model performance as layer count decreases. This study employs a stacked linear regression structure (e.g., MLP) for confidence score prediction based on existing architectures [30]. Future work could explore alternative model configurations such as RNN [32] or Transformer [18] to further optimize network performance.

D. QUALITATIVE RESULT ANALYSIS (RQ4) 验证结果分析

To demonstrate that CDaT successfully filters out semantically misaligned modalities at instance-level impacts requires analyzing qualitative experimental outcomes for RQ4. As depicted in Figure 5, investigating how inference outcomes evolve across CMU-MOSEI benchmark utterances at discourse level upon application of CDaTConfidNet to one of fusion models MISA reveals significant insights. For these utterances where C∗ represents ground-truth labels: The probability of C∗ is expected to remain low when relying solely on existing baselines. However,B following dynamic cross-modal transfer learning guided by modality-aware confidence scores,the probability associated with C∗ increases substantially allowing for more accurate multi-label classification. Application of CDaTConfidNet enhances true label confidence while effectively diminishing impact from misaligned modalities. This can lead to erroneous emotion predictions such as labeling a statement as happy when it should be neutral or sad. In this case,affecting model training with CDaT enables improvement through boosting confidence in correct labels and diminishing it for incorrect ones.This trend persists across multiple examples demonstrating robustness and effectiveness in detecting misaligned modalities and leveraging them through confident-based cross-modal transfer learning processes.

8.CONCLUSION

总结而言，在多模态情感识别任务中针对多模态融合模型的研究中发现，在掩蔽模态中具有一定的纠错能力，并提出了一种新的跨模态动态迁移学习方法（CDaT），该方法通过动态地重新定义跨模态知识迁移损失来解决这一问题。利用我们的模型不可知方法（model-agnostic method），我们尝试将其应用于四个基线融合模型以及两个情感识别基准（emotional recognition benchmarks）。通过我们的实验结果表明（Experimental results show that），CDaT融合算法相比其他融合算法在融合精度以及融合效率上均具有更高的性能表现（higher precision and higher efficiency）。此外，在实例级之后进行了定性研究（qualitative study），展示了将我们的方法应用于实例级之后的情感标签预测以及置信度评分（confidence score）的表现（performance）。因此我们希望在未来的工作中能够将该方法应用于多模态情绪识别任务中的 MER 任务研究（MER task research）。

全部评论 (0)

还没有任何评论哟~

《Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition》

Multimodal系列论文研读目录文章目录 Multimodal系列论文研读目录 1.ABSTRACT 2.INDEXTERMS 3.INTRODUCTION 4.RELATEDWORKS A.M...

A Persona-Infused Cross-Task Graph Network for Multimodal Emotion Recognition with Emotion Shift

基于情绪转移检测的多模态情绪识别跨任务图网络摘要 1\.介绍 2方法 2.1任务定义 2.2特征表示 2.3话语级别的编码器 2.4人格注入的优化网络 2.5多任务交互图网络 2.6转移感知的对比学...

Tailor Versatile Multi-modal Learning for Multi-label Emotion Recognition

题目：TailorVersatileMultimodalLearningforMultilabelEmotionRecognition 时间：2022年总结： 1、作者提出一种基于对抗的模型，模型对...

CFN-ESA: A Cross-Modal Fusion Network With Emotion-Shift Awareness for Dialogue Emotion Recognition

文章目录 CFNESA：用于对话情感识别的具有情感转换意识的跨模态融合网络摘要 1\.介绍 2\.相关工作 2.1对话中的情感识别 2.2多头注意力网络 3\.方法 3.1问题定义 3.2基于循环的...

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

一句话概括：使用提示学习技术，解决情感识别中的模态缺失问题摘要多模态的发展推动了多模态情感分析和情绪识别技术的进步，然而在实际应用中，各种经常会有模态缺失的情况，进而导致模型性能下降，本研究，提出...

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

UniAdapter:UnifiedParameterEfficientTransferLearningforCrossmodalModeling 论文链接：https://arxiv.org/pdf...

Cross Modal Distillation for Supervision Transfer

Abstract 本文关注点是对图片的不同模态，做监督迁移学习。两种不同的模态，一种有大量的标注样本，一种没有。将从标注过样本的模态中学习得到的特征作为监督信号，用于无标注样本模态的特征学习。此方案能...

论文笔记：Cross-modal Contrastive Learning for Multimodal Fake News Detection

CrossmodalContrastiveLearningforMultimodalFakeNewsDetection 文章下载地址：<https://dl.acm.org/doi/abs/10.11...

Dynamic Extraction of Subdialogues for Dialogue Emotion Recognition

对话情感识别的子对话动态提取摘要 1\.介绍 2相关工作 2.1对话上下文建模 2.2常识知识 3方法 3.1问题定义 3.2模型概述 3.3特征提取模块 3.4依赖性建模 3.5交互式子对话提取模...

【论文笔记】Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

🍎个人主页：小嗷犬的个人主页 🍊个人网站：小嗷犬的技术小站 🥭个人信条：为天地立心，为生民立命，为往圣继绝学，为万世开太平。基本信息标题:MultimodalCrossDomainFewSh...

是否确定退出登录?

《Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition》

Multi-modal系列论文研读目录

文章目录

1.ABSTRACT

2.INDEX TERMS

3.INTRODUCTION

4.RELATED WORKS

A. MULTIMODAL EMOTION RECOGNITION 多模态情感识别

1) CONVENTIONAL FUSION METHODS 常规融合方法

2) TRANSFORMER-BASED FUSION METHODS 基于变压器的融合方法

B. CROSS-MODALITYKNOWLEDGE TRANSFER 跨模态知识传递

5.METHODOLOGY 方法

A. UNIMODAL FEATURE EXTRACTION 单峰特征提取

B. PROPOSED FUSION METHOD 拟定融合方法

1) STAGE 1: MISALIGNED MODALITY FILTERING 未对准模态滤波

3) TOTAL LOSS 全损

4) INFERENCE

6.EXPERIMENTAL SETTING

A. DATASET

B. EVALUATION METRICS 评估指标

C. BASELINES

D. IMPLEMENTATION DETAILS

7.RESULTS AND DISCUSSION 结果和讨论

A. OVERALL PERFORMANCE (RQ1) 整体性能

B. MODALITYCONFIDENCE CALIBRATION ANALYSIS(RQ2) 模态置信度校准分析

D. QUALITATIVE RESULT ANALYSIS (RQ4) 验证结果分析

8.CONCLUSION

全部评论 (0)

相关文章推荐

《Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition》

A Persona-Infused Cross-Task Graph Network for Multimodal Emotion Recognition with Emotion Shift

Tailor Versatile Multi-modal Learning for Multi-label Emotion Recognition

CFN-ESA: A Cross-Modal Fusion Network With Emotion-Shift Awareness for Dialogue Emotion Recognition

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Cross Modal Distillation for Supervision Transfer

论文笔记：Cross-modal Contrastive Learning for Multimodal Fake News Detection

Dynamic Extraction of Subdialogues for Dialogue Emotion Recognition

【论文笔记】Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition