Advertisement

[读论文][backbone][DiffKD] Knowledge Diffusion for Distillation

阅读量:

DiffKD

摘要

The discrepancy between teacher and student representations emerges as an increasingly significant topic in knowledge distillation (KD). To bridge this gap and enhance performance, current methods typically employ complex training schemes, loss functions, and feature alignment techniques, which are both task-specific and feature-specific. We argue that the core idea behind these approaches is fundamentally rooted in their shared objective: to eliminate noisy information while preserving valuable insights within feature representations. Consequently, we present a novel KD method referred to as DiffKD, which explicitly denoises student features through the application of diffusion models while simultaneously aligning teacher-student representations. Our methodology leverages the fact that student features inherently contain more noise due to the limitations of their smaller capacity relative to teacher models. To address this challenge, we introduce a lightweight autoencoder-based architecture designed to reduce computational overhead while maintaining denoising performance. Furthermore, we incorporate an adaptive noise matching module tailored specifically for enhancing denoising capabilities. Extensive experiments validate our approach's effectiveness across diverse feature types, consistently outperforming existing methods in tasks such as image classification, object detection, and semantic segmentation.

师生之间的表示差距是知识边缘蒸馏(KD)中一个新兴的话题。

为了缩小差距并提升性能, 目前的方法通常采用了复杂的训练方案、损失函数以及特征对齐技术, 这些技术都是针对特定任务设计的.

Diffusion model in DiffKD.
扩散模型在扩散过程中q(红虚线箭头)中使用教师特征进行训练;同时将学生特征馈送到反向去噪过程p_θ(蓝箭头)以获得去噪特征用于蒸馏。
我们发现容量限制导致学生特征包含更多噪声且其语义信息不如教师显著。
因此我们把学生特征视为教师特征的噪声版本并提出使用教师特征训练的扩散模型来降噪学生特征。

简单说,就是一边蒸馏一边训练扩散模型

introduction

深度神经网络的成功通常是在对大的要求下完成的

计算资源与内存受到设备性能的限制,在实际应用中尤其体现在那些设备较为薄弱的情况下。知识蒸馏方法(KD)[13]作为一种广泛采用的技术,在提升系统效率方面发挥了重要作用。

模型(学生)通过转移更大模型(老师)的知识。

The central aspect of knowledge distillation revolves around effectively transferring knowledge from teacher models to student models through matching output features such as representations and logits.

蒸馏知识的关键在于如何实现知识从教师向学生转移的过程,并通过匹配表示和对数来优化这种转移。

最近以来的相关研究表明[16,28]指出,在两种模型之间存在容量规模差异的情况下,
学生特征与教师特征之间的差异可能表现出显著的分化趋势。
这种分化可能会影响后续的教学效果,
进而导致整体学习效率出现下降。
在优化过程中对齐这些不匹配的特征不仅可能导致干扰,
还可能进一步影响学生的学习效果。

由此可见,在这一领域中占据主导地位的新型KD方法的核心在于精炼关键信息。

For example,
TAKD [28] introduces multiple middle-sized teach assistant models to bridge the gap;
SFTN [29] learns a student-friendly teacher by regularizing the teacher training with student; DIST [16] relaxes the exact matching of teacher and student features of Kullback Leibler (KL) divergence loss by proposing a correlation-based loss;
MasKD [17] distills the valuable information in the features and ignores the noisy regions by learning to identify receptive regions that contribute to the task precision.
However, these methods need to resort to either complicated training schemes or task-specific priors, making them challenging to apply to various tasks and feature types.

TAKD[28]引入了多种中等规模的助教模式来弥补这一差距;

规范性教师培训为SFTN[29]培育了一名以学员为中心的教学理念;相关性损失使得DIST[16]增强了Kullback Leibler (KL)散度损失在师生特征间的精确匹配;MasKD[17]通过学习识别有助于任务精度的接受区域,在提取特征中有价值的信息时避免了对噪声区域的关注

然而这些方法依赖于复杂的训练方案或针对任务的先验知识 并因此难以适应多种不同的任务类型和特征

Our study takes a unique approach and argues in favor of the notion that the crux of knowledge distillation lies in minimizing interference within the distillation features. Our stance is that students are inherently noisy versions of their teachers, primarily because of limitations in capacity or training procedures.

However, extracting knowledge from this disturbing information can have adverse effects on the student's learning. Additionally, it may even impede their ability to effectively acquire new concepts.

lead to unintended degradation.
Therefore we propose to eliminate the noisy information within the student and distill only the valuable information accordingly.
Inspired by the successful application of generative tasks we leverage diffusion models [ 14 39 ] a class of probabilistic generative models that can gradually remove the noise from an image or a feature in order to perform the denoising module.
As shown in Figure 1 our DiffKD framework is illustrated.
Through empirical results we demonstrate that this straightforward denoising process is capable of generating a denoised student feature that closely resembles its corresponding teacher feature thereby ensuring that our knowledge distillation can be conducted in a more consistent manner.

本文从不同的角度出发,认为知识蒸馏的devil在于蒸馏特征中的噪声。

从直觉上来看,我们将其视为效率较低型的导师角色,并因其实力或训练手段相对有限而未能真正掌握具有价值和体面的特征。

然而,用这种噪音提炼知识对学生是有害的,甚至可能

导致不希望的退化。

因此,我们建议消除学生内部的噪声信息,只提取有价值的信息。

具体来说,在生成任务取得成功的基础上

我们的DiffKD的概述如图1所示。

我们积累了一定的经验,在这一过程中,该过程能够生成一个具有学生特征的去噪版本。这一去噪版本与其相对应的教师特征相当相似。这使得我们的蒸馏操作得以以更为统一的方式进行。

Nevertheless, directly leveraging diffusion models in knowledge distillation has two major issues.
(1) Expensive computation cost.
The conventional diffusion models use a UNet-based architecture to predict the noise, and take a large amount of computations to generate high-quality images2.
In DiffKD, a lighter diffusion model would suffice since we only need to denoise the student feature.
We therefore propose a light-weight diffusion model consisting of two bottleneck blocks in ResNet [10].
Besides, inspired by Latent Diffusion [34], we also adopt a linear autoencoder to compress the teacher feature, which further reduces the computation cost.
(2) Inexact noisy level of student feature.
The reverse denoising process in diffusion requires to start from a certain initial timestep, but in DiffKD, the student feature is used as the initial noisy feature and we cannot directly get its corresponding noisy level (timestep); therefore, the inexact noisy level would weaken the denoising performance.
To solve this problem, we propose an adaptive noise matching module, which measures the noisy level of each student feature adaptively and specifies a corresponding Gaussian noise to the feature to match the correct noisy level in initialization.
With these two improvements, our resulting method DiffKD is efficient and effective, and can be easily implemented on various tasks.

然而,在知识蒸馏中直接利用扩散模型存在两个主要问题。

(1)计算成本昂贵。

现有的扩散模型通常以unet架构为基础构建其噪声预测机制,并不仅需要耗费大量计算资源,而且能够在有限的时间内输出高质量的图像样本

Within the DiffKD framework, a lighter diffusion model suffices since we only require the denoised student features.

基于此,在ResNet框架下提出了一种基于两个瓶颈模块构建的轻量化扩散模型结构[10]

此外,在借鉴了Latent Diffusion[34]的相关理论框架的基础上,在此基础上我们采用了线性自编码器来提取教师网络的特征信息,并成功降低了计算开销。

(2)学生特征噪声水平不准确。

反向降噪过程中需要基于某个起始时间步长展开操作;而在DiffKD方法中将学生特征设定为初始噪声特征,则无法直接获得与其相关的噪声电平(时间步长)。因此,在实际应用中这可能导致降噪效果受到一定影响。

以解决此问题为目标,我们开发出一种自适应噪声匹配机制。该机制能够动态评估每个学生特征的噪声水平,并赋予相应的高斯噪声分布,在实现初始状态下的精确噪声配置下完成正确对齐。

基于这两个重要改进措施,我们开发出的方法DiffKD不仅表现出良好的效率和效果,并且可以在多种不同的应用场景中轻松实现。

值得注意的是,我们方法DiffKD具有的一个重要优点是特征agnostic性;其知识扩散技术可应用于多种类型的特征数据包括中间特征、分类输出以及回归输出等不同形式的数据类型。
Experimental studies consistently reveal that our DiffKD architecture significantly outperforms current state-of-the-art methods across standard model configurations for image classification、object detection以及 semantic segmentation tasks.
For instance; the method achieves an impressive 73.62% top-level accuracy when paired with a MobileNetV1 student model and a ResNet-50 teacher model on the ImageNet dataset; this performance surpasses the baseline DKD approach [50] by a margin of 1.57%. On the other hand; when applied to semantic segmentation tasks using PSPNet-R18 as the student model and Cityscapes as the test set; DiffKD demonstrates superior performance by outperforming MasKD [17] by approximately 1%.
Additionally; to further validate our approach's efficacy in bridging the gap between teacher and student features; we conducted experiments using more advanced teacher models such as Swin-T (student) paired with Swin-L (teacher). These results highlight that our method not only achieves state-of-the-art performance but also significantly outperforms existing approaches by delivering exceptional performance metrics.
For example; when employing Swin-T as the student model and Swin-L as the teacher model; our DiffKD architecture attains an outstanding 82.5% validation accuracy on ImageNet dataset; which represents a notable improvement over existing KD baseline approaches by approximately 1%.

值得注意的是, 我们的方法DiffKD的一个显著优势在于其未知性特性; 知识扩散技术能够有效地适用于多种类型的特征, 并非仅限于中间特征这一单一类型; 具体而言, 该技术能够支持中间特征. 分类输出以及回归输出等多种应用场景.

多样化的实验研究显示,在图像分类、目标检测以及语义分割等标准模型设置方面,我们的DiffKD持续超越当前最先进水平的方法。

例如,在ImageNet评估中, DiffKD对于MobileNetV1学生与ResNet-50教师而言,其准确率达到73.62%,相较于DKD[50]提升了约1.57%. 在进行评估时, 使用了PSPNet-R18学生进行Cityscape测试集上的比较研究,结果显示,DiffKD在语义分割任务中的表现优于MasKD[17], 其准确率高出约1%.

此外,为了证明我们在消除教师和学生特征之间差异方面的有效性,我们还在具有更先进的教师模型的更强的教师设置上实现了DiffKD,并且我们的方法显着优于现有方法。

在swwin-t的学生与swwin-l的老师的场景下,在ImageNet上我们的DiffKD取得了令人瞩目的82.5%的准确率,并较之于KD基准方法提升了1%。

全部评论 (0)

还没有任何评论哟~