Two-Stream SR-CNNs for Action Recognition in Videos

阅读量：

该论文：http://www.bmva.org/bmvc/2016/papers/paper108/index.html
该代码库：https://github.com/yifita/action.sr_cnn
三人组个人网站：http://wanglimin.github.io/

Two-Stream SR-CNNs for Action Recognition in Videos

dataset : UCF101 JHMDB(split 1)
accuracy: 92.6 53.77

framework

依然是基于双流的输入；但经过fasterrcnn处理后会得到不同的区域，并将其分为三个类别：场景、人和物体；然后分别作为输入供网络训练

Standard convolutional and pooling layers are initially applied to the inputs.Instead of replacing the last pooling layer we introduce a RoiPooling [2] layer whose features are separated into parallel fully connected layers (called channels) using bounding boxes from a Faster R-CNN object detector [18] as described in subsection 3.2

每个通道都会被赋予独立的分数；考虑到存在多个物体，作者采用了MIL（即Multiple Instance Learning）这一技术来整合最具价值的信息。最终将所有得分整合到一个融合层中以得出最终预测结果。

Fusion

fusion的策略，作者提出了4个：

Max fusing selects the highest score values across all channels within a specific class to identify its strongest channel.
Sum fusing summarizes and aggregates score values from multiple channels while treating each one equally.
Category-wise weighted fusion (Weighted-1) assigns weights to individual channel scores through a weighted summation process. This method aims to reflect varying levels of contribution that each channel may have towards different classes by applying corresponding weight coefficients.
Correlationwise weighted fusion integrates data across multiple classes by considering cross-class relationships when calculating score values. This approach implicitly encodes correlations between different categories through comprehensive analysis.

Considering a system comprising L classes and C channels, the total number of weights for Weighted Sum-1 and Weighted Sum-2 amounts to $L\times C$ and $L\times L \times C$ , respectively. Through the back-propagation process, all weights, along with the primary network parameters, are simultaneously trained.

Semantic channels

Detector

Object detection task. Object detection within video datasets presents significant challenges, particularly due to poor image quality and blurring caused by movement. 所以, authors have introduced specific criteria to eliminate

预测结果的置信度低于预设阈值（0.1）
- 图像长度小于其set value设定的20像素
- 未与他人产生重叠区域

Person detection. 同样需要：

筛选出错误或冗余的数据输出
修复断开或丢失的部分帧信息
优化边界框的位置信息

Implementation details

The fundamental network utilizes the VGG16 architecture, which has become a widely adopted baseline in many computer vision tasks due to its robust feature extraction capabilities. The system adheres to the configuration of a two-stream setup, where each stream processes distinct visual information streams. The input resolution is set at 256x340 pixels, ensuring consistent feature extraction across different scales and orientations. The spatial stream employs RGB color channels to capture static visual content, while the temporal stream is constructed by aggregating optical flow fields across ten consecutive frames. The data augmentation technique involves a corner cropping approach to enhance model robustness against viewpoint changes and background clutter.

training

Dropout比率设定为 $dropout\_rate = 4/5$ 并应用于两层全连接层；迭代次数设定为spation = 1e4；第一层的学习率设置为 $lr = 1e-3$ 并采用每4,276次迭代自动减小的方式；第二层的学习率设置为 $lr = 5e-3$ 并且经过约7,769步训练后进行减半；批量大小设为batch_size = 256且适用于两层网络

testing

In testing, following the approach outlined in reference [20], we select 10 samples from each of the 5 crops and apply a total of 2 flips per frame. The final classification outcome for a single video is determined by averaging the scores obtained from evaluating 25 evenly distributed frames, ensuring that all valid crop regions are considered.

evaluation

见原文。

全部评论 (0)

还没有任何评论哟~

Two-Stream SR-CNNs for Action Recognition in Videos

paper：<http://www.bmva.org/bmvc/2016/papers/paper108/index.html code：<https://github.com/yifita/acti...

[行为识别] Two –Stream CNN for Action Recognition in Videos

这篇文章发表于2014NIPS。也是牛津大学产出的。在这篇文章出来之前其实也有人尝试用深度学习来处理行为识别，例如李飞飞团队【Largescalevideoclassificationwith】通过叠...

《Two-Stream Convolutional Networks for Action Recognition in Videos》论文笔记

这篇论文是2015年发表在NIPS上的一篇文章，利用双流卷积神经网为视频中的行为识别提供类一种新的思路。下面是个人做的总结和部分翻译。论文贡献提出了一个结合时间和空间网络的双流卷及网络构架。证...

论文阅读：Two-Stream Convolutional Networks for Action Recognition in Videos

论文阅读：TwoStreamConvolutionalNetworksforActionRecognitioninVideos 摘要主要研究如何使用深度卷积神经网络去做视频里的动作识别，难点在于同时...

Two-Stream Convolutional Networks for Action Recognition in Videos读书笔记

看完这篇文献已经好多天了，重新梳理一下双流Conv网络的读书笔记。仅个人见解，望大家指正交流。行为识别是计算机视觉中一个非常重要的方向，无论是科学研究领域还是工业控制领域，都有极高的实用价值，对交通...

Two-Stream Convolutional Networks for Action Recognition in Videos算法笔记

论文：TwoStreamConvolutionalNetworksforActionRecognitioninVideos 链接：<https://arxiv.org/abs/1406.2199 这篇...

双流网络泛读【Two-Stream Convolutional Networks for Action Recognition in Videos】

目录 0、前沿 1、标题 2、摘要 3、结论 4、重要图表 5、解决了什么问题 6、采用了什么方法 7、达到了什么效果 0、前沿泛读我们主要读文章标题，摘要、结论和图表数据四个部分。需要回答用什么方...

Two-Stream Convolutional Networks for Action Recognition in Video

TwoStreamConvolutionalNetworksforActionRecognitioninVideo 我们提出了一种有区分训练的卷积网络来识别视频中的动作。挑战是从视频帧中获取外观和运动...

【源码】Convolutional Two-Stream Network Fusion for Video Action Recognition

ConvolutionalTwoStreamNetworkFusionforVideoActionRecognition 环境准备运行代码代码阅读依赖关系目录结构 cnnucf101spati...

【论文】Convolutional Two-Stream Network Fusion for Video Action Recognition

ConvolutionalTwoStreamNetworkFusionforVideoActionRecognition 双流网络的不足空间融合融合方式融合位置时间融合 3DConv和3DPo...

是否确定退出登录?

Two-Stream SR-CNNs for Action Recognition in Videos