Advertisement

Two-Stream SR-CNNs for Action Recognition in Videos

阅读量:

该论文:http://www.bmva.org/bmvc/2016/papers/paper108/index.html
该代码库:https://github.com/yifita/action.sr_cnn
三人组个人网站:http://wanglimin.github.io/

Two-Stream SR-CNNs for Action Recognition in Videos

dataset : UCF101 JHMDB(split 1)
accuracy: 92.6 53.77

framework

依然是基于双流的输入;但经过fasterrcnn处理后会得到不同的区域,并将其分为三个类别:场景、人和物体;然后分别作为输入供网络训练

这里写图片描述

Standard convolutional and pooling layers are initially applied to the inputs.Instead of replacing the last pooling layer we introduce a RoiPooling [2] layer whose features are separated into parallel fully connected layers (called channels) using bounding boxes from a Faster R-CNN object detector [18] as described in subsection 3.2

每个通道都会被赋予独立的分数;考虑到存在多个物体,作者采用了MIL(即Multiple Instance Learning)这一技术来整合最具价值的信息。最终将所有得分整合到一个融合层中以得出最终预测结果。

Fusion

fusion的策略,作者提出了4个:

  • Max fusing selects the highest score values across all channels within a specific class to identify its strongest channel.
  • Sum fusing summarizes and aggregates score values from multiple channels while treating each one equally.
  • Category-wise weighted fusion (Weighted-1) assigns weights to individual channel scores through a weighted summation process. This method aims to reflect varying levels of contribution that each channel may have towards different classes by applying corresponding weight coefficients.
  • Correlationwise weighted fusion integrates data across multiple classes by considering cross-class relationships when calculating score values. This approach implicitly encodes correlations between different categories through comprehensive analysis.

Considering a system comprising L classes and C channels, the total number of weights for Weighted Sum-1 and Weighted Sum-2 amounts to L\times C and L\times L \times C, respectively. Through the back-propagation process, all weights, along with the primary network parameters, are simultaneously trained.

Semantic channels

Detector

Object detection task. Object detection within video datasets presents significant challenges, particularly due to poor image quality and blurring caused by movement. 所以, authors have introduced specific criteria to eliminate

  • 预测结果的置信度低于预设阈值(0.1)
    • 图像长度小于其set value设定的20像素
    • 未与他人产生重叠区域

Person detection. 同样需要:

  • 筛选出错误或冗余的数据输出
  • 修复断开或丢失的部分帧信息
  • 优化边界框的位置信息

Implementation details

The fundamental network utilizes the VGG16 architecture, which has become a widely adopted baseline in many computer vision tasks due to its robust feature extraction capabilities. The system adheres to the configuration of a two-stream setup, where each stream processes distinct visual information streams. The input resolution is set at 256x340 pixels, ensuring consistent feature extraction across different scales and orientations. The spatial stream employs RGB color channels to capture static visual content, while the temporal stream is constructed by aggregating optical flow fields across ten consecutive frames. The data augmentation technique involves a corner cropping approach to enhance model robustness against viewpoint changes and background clutter.

training

Dropout比率设定为 dropout\_rate = 4/5 并应用于两层全连接层;迭代次数设定为spation = 1e4;第一层的学习率设置为 lr = 1e-3 并采用每4,276次迭代自动减小的方式;第二层的学习率设置为 lr = 5e-3 并且经过约7,769步训练后进行减半;批量大小设为batch_size = 256且适用于两层网络

testing

In testing, following the approach outlined in reference [20], we select 10 samples from each of the 5 crops and apply a total of 2 flips per frame. The final classification outcome for a single video is determined by averaging the scores obtained from evaluating 25 evenly distributed frames, ensuring that all valid crop regions are considered.

evaluation

见原文。

全部评论 (0)

还没有任何评论哟~