论文笔记:Evolving Losses for Unsupervised Video Representation Learning
Evolving Losses for Unsupervised Video Representation Learning 论文笔记
Distillation
Knowledge Distillation from: zhihu
Distillate Knowledge from Teacher model Net-T to Student model Net-S.

目的:为了精简模型方便部署。
L=\alpha L_{s o f t}+\beta L_{h a r d}
L_{s o f t}=-\sum_{j}^{N} p_{j}^{T} \log \left(q_{j}^{T}\right), \text { where } p_{l}^{T}=\frac{\exp \left(v_{i} / T\right)}{\sum_{k}^{N} \exp \left(v_{k} / T\right)}, q_{i}^{T}=\frac{\exp \left(z_{i} / T\right)}{\sum_{k}^{N} \exp \left(z_{k} / T\right)}
L_{h a r d}=-\sum_{j}^{N} c_{j} \log \left(q_{j}^{1}\right), \text { where } q_{i}^{1}=\frac{\exp \left(z_{i}\right)}{\sum_{j}^{N} \exp \left(z_{j}\right)}
第一部分是从Teacher 模型中学习,第二部分是从ground truth 中学习
温度的高低改变的是Net-S训练过程中对负标签的关注程度: 温度较低时,对负标签的关注,尤其是那些显著低于平均值的负标签的关注较少;而温度较高时,负标签相关的值会相对增大,Net-S会相对多地关注到负标签。
Main idea: Multiple modalities to multiple tasks

Loss Function
\mathcal{L}=\sum_{m} \sum_{t} \lambda_{m, t} \mathcal{L}_{m, t}+\sum_{d} \lambda_{d} \mathcal{L}_{d}
where
\lambda is weight
\mathcal{L}_{m,t} is loss function of modality m to task t
\mathcal{L}_{d} is L_2 distance of a layer in the main network M_i to another network L_i
\mathcal{L}_{d}\left(L_{i}, M_{i}\right)=\left\|L_{i}-M_{i}\right\|_{2}
Evolution Algorithm
Using GA to determine the \lambda
Each λ_{m,t} or{λ_d} is constrained to be in [0,1]
Unsupervised loss function
Zipf Distribution matching (ELo)
cluster centroids \left\{c_{1}, c_{2}, \ldots c_{k}\right\} \text { where } c_{i} \in \mathcal{R}^{D}
Naively assuming all clusters have the same variance, and let 2\sigma^2 = 1
we can compute the probability of a feature vector x ∈ R^D belonging to a cluster c_i as
p\left(x \mid c_{i}\right)=\frac{1}{\sqrt{2 \sigma^{2} \pi}} \exp \left(-\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}\right)
Bayes rules:
\begin{aligned} p\left(c_{i} \mid x\right) &=\frac{p\left(c_{i}\right) p\left(x \mid c_{i}\right)}{\sum_{j}^{k} p\left(c_{j}\right) p\left(x \mid c_{j}\right)}=\frac{\exp -\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}}{\sum_{j=1}^{k} \exp -\frac{\left(x-c_{j}\right)^{2}}{2 \sigma^{2}}} \\ &=\frac{\exp -\left(x-c_{i}\right)^{2}}{\sum_{j=1}^{k} \exp -\left(x-c_{j}\right)^{2}} \end{aligned}
which is standard softmax function
given the above probability of each video belonging to each cluster, and the Zipf distribution, we compute the prior probability of each class as q\left(c_{i}\right)=\frac{1 / i^{s}}{H_{k, s}} where H is k_{th} harmonic number and s is real constant.
p\left(c_{i}\right)=\frac{1}{N} \sum_{x \in V} p\left(c_{i} \mid x\right), the average over all videos in the set.
KL divergence :
K L(p \| q)=\sum_{i=1}^{k} p\left(c_{i}\right) \log \left(\frac{p\left(c_{i}\right)}{q\left(c_{i}\right)}\right)
This will be our fitness function.
it poses a prior constraint over the distribution of (learned) video representations in clusters to follow the Zipf distribution.
Loss Evolution
tournament selection and CMA-ES.
