论文笔记--ActionVLAD: Learning spatio-temporal aggregation for action classification

阅读量：

介绍

这是去年CVPR2017的一篇动作分类的文章，用tensorflow实现，有预训练模型，代码链接如下：
http://rohitgirdhar.github.io/ActionVLAD
这篇文章在时空上分别独立提取特征，然后做pooling聚合，采用了一种VLAD的pooling方法，端到端的训练，主要解决两个疑惑：
1.如何聚合视频帧之间的特征来表示整个视频。
2.在多流网络中(例如two-stream)里面如何整合不同流(双流)的信息。

动机

这里写图片描述
一整段视频中虽然是单一的动作分类，但是可能有多个不同的子类特征，例如basketball shoot中有hoop，dribbling，jump，group，throw，ball，running等多个子类特征，所以我们想找方法去整合多个子类特征到一整段视频特征的表示中。

视频双流结构

为了解决上述motivation中的问题，中聚合层中就有了这样一个参数，action words。它可以理解为子类特征的聚类中心。

可训练的时空聚合层

设 $x_{i,t} \in R^D$ ,是一个从一段视频中帧 $t \in \{1...T\}$ 的位置 $i \in \{1...N\}$ 中提取的D-dimension局部特征描述子。划分这个特征描述子空间 $R^D$ 到K个action words，可以表示为锚点 $\{c_k\}$ .,那么每一个描述子 $x_{i,t}$ 被分配到一个聚类中心，残差向量 $x_{i,t}-c_k$ 表示描述子和锚点之间的残差,然后把整段视频的残差累加
这里写图片描述
这里 $\alpha$ 是一个超参数，j是指D-dimension中的第j个维度。输出V是一个矩阵，表示k个聚类中心的D-dimension特征描述子，经过normalize后展开为 $v \in R^{KD}$ 描述子即可表示整个视频。

这是整个视频双流结构图，可以看出基于two-stream框架，就是多了一个可训练的ActionVLAD层。
这里写图片描述
这个则是不同pooling策略建模的对比，可以看出max pooling和average pooling都只能关注到部分子类特征，而ActionVLAD却可以聚合不同子类特征的描述子来共同描述视频特征。

聚合层位置

1.FC层就如上述描述都D-dimension描述子操作即可。
2.conv层需要将feature map展开为 $h \times w \times c$ 维描述子。

双流融合方法

这里写图片描述
第一种是直接concat策略，第二种和第三种都是average策略，只是融合位置稍显不同。

实现细节

K=64,=1000,dropout=0.5,T=25,梯度裁剪阈值为5，adam里面 $\epsilon=10^{-4}$ ，分两阶段训练：第一阶段，先用kmeans固定ActionVLAD，学习率为0.01；第二阶段，再学习ActionVLAD，学习率为 $10^{-4}$ .

效果展示

这里写图片描述
这个是可视化ActionVLAD模型学到的action words，可以看出(b)中的action word明显就是头发。

个人思考

1.位置如何选取的呢？是随机选取位置resize训练吗？还是selective selction呢？
2.聚类中心action words如何可视化的呢？这个D-dimension的向量怎么能可视化出这个效果呢？

全部评论 (0)

还没有任何评论哟~

论文笔记--ActionVLAD: Learning spatio-temporal aggregation for action classification

介绍这是去年CVPR2017的一篇动作分类的文章，用tensorflow实现，有预训练模型，代码链接如下： <http://rohitgirdhar.github.io/ActionVLAD 这篇文...

Spatio-Temporal Channel Correlation Networks for Action Classification

Spatio-Temporal Channel Correlation Networks for Action Classification Abstract 引入STC块对3D卷积神经网络进行关于时...

论文笔记-Action Recognition-＜TEA: Temporal Excitation and Aggregation for Action Recognition＞

论文笔记：TEA:TemporalExcitationandAggregationforActionRecognition Author：YanLi[1],BinJi[2],XintianShi[1]...

论文笔记《Spatio-Temporal Graph Structure Learning for Traffic Forecasting》

【论文】ZhangQ,ChangJ,MengG,etal.SpatioTemporalGraphStructureLearningforTrafficForecasting[C]//Proceedin...

论文笔记：SPATIO-TEMPORAL STRUCTURE CONSISTENCY FOR SEMI-SUPERVISED MEDICAL IMAGE CLASSIFICATION (ICASSP)

论文链接：<https://arxiv.org/pdf/2303.01707 论文代码：暂无写在前面：在看论文和查阅资料发现写医学图像半监督学习方面的博文不是很多，所以就写下了这篇。花了一个周末仔...

论文Action Tubelet Detector for Spatio-Temporal Action Localization解读

论文链接 https://arxiv.org/abs/1705.01861 动机当前的行为定位算法都是在每一帧上进行目标检测得到空间定位，再连接每一帧上的检测结果得到时间上的定位。

【论文复现PaddlePaddle】 Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognit

【论文复现PaddlePaddle】LearningSpatioTemporalFeatureswith3DResidualNetworksforActionRecognition（一）论文阅读这篇...

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting 论文笔记

SpatioTemporalGraphConvolutionalNetworks:ADeepLearningFrameworkforTrafficForecasting论文笔记\GCN在交通领域的应用...

Learning Spatio-Temporal Transformer for Visual Tracking（论文翻译）

目录摘要 1.引言 2.相关工作 3.方法 3.1transformer的基准 3.2时空transformer跟踪 4.实验 4.1实施细节 4.2结果和比较 4.3基于组件的分析 4.4与其他框...

Learning Spatio-Temporal Transformer for Visual Tracking——精读笔记

本篇精读笔记，对原文重要部分做了严格翻译，如摘要和总结。对正文部分做了提炼，对重点部分突出标注。对参考文献做了分类。本文内容较长，如果时间有限可以直接跳到感兴趣的小节阅读。

是否确定退出登录?

论文笔记--ActionVLAD: Learning spatio-temporal aggregation for action classification

介绍

动机

相关工作

视频双流结构

可训练的时空聚合层

聚合层位置

双流融合方法

实现细节

效果展示

个人思考

全部评论 (0)

相关文章推荐

论文笔记--ActionVLAD: Learning spatio-temporal aggregation for action classification

Spatio-Temporal Channel Correlation Networks for Action Classification

论文笔记-Action Recognition-＜TEA: Temporal Excitation and Aggregation for Action Recognition＞

论文笔记《Spatio-Temporal Graph Structure Learning for Traffic Forecasting》

论文笔记：SPATIO-TEMPORAL STRUCTURE CONSISTENCY FOR SEMI-SUPERVISED MEDICAL IMAGE CLASSIFICATION (ICASSP)

论文Action Tubelet Detector for Spatio-Temporal Action Localization解读

【论文复现PaddlePaddle】 Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognit

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting 论文笔记

Learning Spatio-Temporal Transformer for Visual Tracking（论文翻译）

Learning Spatio-Temporal Transformer for Visual Tracking——精读笔记