[HOI Transfomer] End-to-End Human Object Interaction Detection with HOI Transformer(CVPR. 2021)

阅读量：

1. Motivation

目前现有的HOI（任务交互）领域的方法是one-stage或者two-stage的。

Existing methods either separate the HOI task into distinct stages of object detection and interaction classification or introduce a surrogate interaction issue.

本文旨在通过端到端的架构设计将transformer模型应用于human object interaction（HOI）检测任务中。

2. Relation Work

2.1 HOI’s Goal

The primary objective of HOI detection is to detect humans and objects while identifying interaction patterns between them. As shown in Figure 1, the comparison comprises three distinct approaches: previous one-stage and two-stage methods, as well as the end-to-end method proposed in this work.

2.2 Two-Stage HOI Detection

Two-stage策略将HOI检测任务划分为目标检测与关系分类两大模块具体而言该方法首先通过预先训练好的目标检测模型获取人与物的具体位置信息随后基于这些定位信息进行人-物关系特征提取最后完成关系分类过程这一流程会带来子优化问题由于生成的人-物proposals在用于关系分类时可能质量不高而且在处理成对的人-物proposals时会产生冗余计算开销

Separate optimization of the two sub-problems might result in a suboptimal outcome.

2.3 One-Stage HOI Detection

单一阶段的方法也被提出作为一种替代交互检测方法（alternative interaction detection method），以直接提升对HOI的处理效率。该方法的核心在于通过预先设定这些关系来进行描述。具体而言，在现有研究中常见的实现方式主要包括以下几种：第一种是基于联合边界框（Union bbox）的方式（UnionDet），这种方法能够同时捕获目标之间的空间关系；第二种则是采用类似于中心点定位的思想（PPDM），它能够在人与物体之间找到一个中心点作为互动点。在实际操作中，这些关键点会被系统性地分类，并通过多通道卷积神经网络进行特征提取与融合计算。值得注意的是，在这一过程中会引入一定的计算开销以保证结果精度的同时，在后续阶段会采用基于上下文注意力机制的技术来进一步优化结果的质量。然而作者指出这种方法虽然能在一定程度上提高效率但其局限性在于无法完全覆盖所有可能的人体姿态变化情况因此在某些特定场景下可能会导致预测结果出现偏差

2. Contribution

本文开发了HOI Transformer模型；该方法包含两个主要组成部分：基于编码器-解码器的Transformer架构以及 quintuple HOI匹配损失函数。

3. Method

3.1 Network Architecture

如图2所示展示了HOI网络架构，在该架构中使用CNN来提取图像特征信息。随后，在通道维度上进行了降维操作，并结合位置编码信息，在空间维度上进行了扁平化处理。这些操作得到的结果被作为查询、键和值输入到transformer encoder模块中。继而在transformer decoder模块中采用与DETR类似的方式处理这N个可学习的位置编码（即HOI查询），将其转换为对应的输出嵌入。最后阶段的MLP结构被用来预测5元组形式的HOI实例

整个结构基本和DETR类似，不展开描述。

核心组件
编码器模块
解码器模块：交叉注意力机制指的是编码器输出作为Value和Key，并与解码器进行HOI查询操作以执行注意力计算。
多层感知机用于HOI预测任务

3.2 HOI Instance Matching

该模型对human-object-action行为发生的预测概率可通过条件概率进行分解得到具体数值。具体而言，在计算该行为发生的可能性时，则将其视为两个独立事件：即单独的人类行为及其对应的物体行为之间的互动关系。这种假设下，则能够将整体的行为预测问题转化为分别计算人类行为发生概率、物体行为发生概率以及两者在特定情境下的互动影响概率三个部分的问题。

Matching cost由公式2表示。

$L_{match}$ 由公式3表示，分为了对h，o，r的cls cost 以及对于h，o的bbox cost。

对于cls cost/loss，采用交叉熵损失，对于bbox cost/loss，采用L1和GIOU loss。

本文使用匈牙利算法进行二分图匹配，找到一个最优匹配。

4. Experiments

4.1 Datasets

探讨HICO-DET与V-COCO数据集的具体构成是什么？这部分内容有助于知识体系的补充。

We evaluate our methods on two benchmark datasets: HICO-DET [5] and V-COCO [11]. The HICO-DET dataset features a total of approximately 47,776 images containing over 150,000 human-object interaction pairs—comprising roughly three times as many training images (38,118) as test images (9,658). This dataset spans a diverse range of HOI (Human Interaction) categories across more than 117 distinct interactions and involves interactions with over 80 unique object types. Furthermore, these HOI categories are categorized into two subsets based on their frequency of occurrence during training: a smaller group of rare interactions (138) and a larger group of common interactions (462). The V-COCO dataset serves as a specialized version of MS-COCO [21], comprising approximately five thousand training-validation images (5,400) and nearly five thousand test images (4,946). Each individual is assigned binary labels corresponding to five action categories; however, four of these do not involve associated objects.

4.2 Comparsions with SOTA

HICO-DET DATASETS

V-COCO DATASETS

4.3 Ablation Study

4.5 Discussion

泛化性

4.4 Qualitative Analysis

全部评论 (0)

还没有任何评论哟~

[HOI Transfomer] End-to-End Human Object Interaction Detection with HOI Transformer(CVPR. 2021)

1\.Motivation 目前现有的HOI（任务交互）领域的方法是onestage或者twostage的。 CurrentapproacheseitherdecoupleHOItaskintosep...

HOTR: End-to-End Human-Object Interaction Detection with Transformers

模型在vcoco场景1上的验证效果模型在vcoco场景2上的验证效果模型在HICODET上的验证效果 HOTR的模型结构图如下所示: 在代码中如何实现的? 1. 在Backbone中: 1将图片[b...

paper reading(2)-HOTR: End-to-End Human-Object Interaction Detection with Transformers

注：该文章取自CVPR2021 源码： Abstract 首先介绍了一下HOI任务：检测人与物体交互关系的任务，包含 i定位交互的主体和客体 ii交互标签的分类大多数现有的方法是通过检测人和对象，分...

HOTR: End-to-End Human-Object Interaction Detection with Transformers论文阅读笔记

一、本文的内容 1.研究目的本文提出了一种基于transformer的人物交互的新的框架，它能够根据图像预测出apairof三元组人，物，交互，通过该集合预测，能够利用图像中的语义信息，并且，不需要...

[Sparse R-CNN]Sparse R-CNN: End-to-End Object Detection with Learnable Proposals (CVPR. 2021)

SparseRCNN:EndtoEndObjectDetectionwithLearnableProposals paper:<https://arxiv.org/pdf/2011.12450.pdf...

End-to-End Object Detection with Transformers

EndtoEndObjectDetectionwithTransformers 会议：2020ECCV 论文：<https://arxiv.org/abs/2005.12872 代码：<https:/...

End-to-End Object Detection with Transformers解读

paper:https://arxiv.org/abs/2005.12872 Github开源地址：facebookresearch/detr 一、创新点将目标检测任务转化为一个序列预测（setpr...

CVPR 2021 End-to-End Video Instance Segmentation with Transformers

动机 1、实例分割是计算机视觉的基本任务之一。虽然在图像分割方面取得了重大进展，在视频中分割实例方面，还需要额外做更多的研究进行攻克。 2、最先进的方法通常是开发复杂的流程来解决这项任务。自上而下的方...

End-to-End Semi-Supervised Object Detection with Soft Teacher

作者摘要本文提出了一种端到端的半监督目标检测方法，与以前更复杂的多阶段方法相比。在课程中，端到端的训练逐渐提高了伪标签的质量，而越来越精确的伪标签反过来又有利于目标检测训练。

DETR:End-to-End Object Detection with Transformers

论文地址：https://arxiv.org/abs/2005.12872 代码地址：https://github.com/facebookresearch/detr 在看完Transformer之后...

是否确定退出登录?

[HOI Transfomer] End-to-End Human Object Interaction Detection with HOI Transformer(CVPR. 2021)

1. Motivation

2. Relation Work

2.1 HOI’s Goal

2.2 Two-Stage HOI Detection

2.3 One-Stage HOI Detection

2. Contribution

3. Method

3.1 Network Architecture

3.2 HOI Instance Matching

4. Experiments

4.1 Datasets

4.2 Comparsions with SOTA

4.3 Ablation Study

4.5 Discussion

4.4 Qualitative Analysis

全部评论 (0)

相关文章推荐

[HOI Transfomer] End-to-End Human Object Interaction Detection with HOI Transformer(CVPR. 2021)

HOTR: End-to-End Human-Object Interaction Detection with Transformers

paper reading(2)-HOTR: End-to-End Human-Object Interaction Detection with Transformers

HOTR: End-to-End Human-Object Interaction Detection with Transformers论文阅读笔记

[Sparse R-CNN]Sparse R-CNN: End-to-End Object Detection with Learnable Proposals (CVPR. 2021)

End-to-End Object Detection with Transformers

End-to-End Object Detection with Transformers解读

CVPR 2021 End-to-End Video Instance Segmentation with Transformers

End-to-End Semi-Supervised Object Detection with Soft Teacher

DETR:End-to-End Object Detection with Transformers