Advertisement

论文笔记|Rich feature hierarchies for accurate object detection and semantic segmentation

阅读量:

Authors

Ross Girshick /Jeff Donahue/Trevor Darrell /Jitendra Malik
Ross Girshick
Ross Girshick

Abstract

R-CNN:Regions with CNN features. It combines two key insights:
1. apply cnns to bottom-up region proposals
2. supervised pre-training for an anuxiliary task(fine-tuning)

1 Introduction

1.1 from hog->cnns

The last decade of progress on various visual recognition tasks has been based on SIFT and HOG,which we could associate them with complex cells in V1, but they still perform poorly. we need more multi-stage processes for computing features.
Fukushinma–neocognitron– lacked a supervised training algorithm
LeCun et al. – SGD+backpropagation was effective for training CNNs
CNNs saw heavy use in 1990s, but then fell out of fashion with the rise of SVM, Krizhevsky et al. rekindled interest in CNNs,in ILSVRC 2012(rectifying non-linearities and “dropout” regularization) .
This paper is the first to show that a cnns can lead to dramatically higher object detection performance on PASCAL VOC than HOG-like features.

1.2 two problems

this paper focused on two problems: localizing objects with a deep network and training a high-capacity model with a small quantity of annotated data.
1. localization mathod
- as a regression problem
- sliding-window (loss precision) (overFeat)
2. fine-tuning

1.3 efficient

1.4 dominant error mode

a simple bounding-box regression method significantly reduce mislocalizations

2 Object detection with R-CNN

this system consists of three modules.
1. generates category-independent region proposals
2. extracts a fixed-length feature vector from region using cnn
3. linear SVMs

2.1 module design

2.1.1 region proposals

some examples: 1objectness, 2selective search, 3category-independent object proposals,4consitrained parametric min-cuts(CPMC),5multi-scale combinatorial grouping, 6ciresan….
we use selective search to enable a controlled comparsion with prior detection work.

复制代码
    J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.Selective search for object recognition. IJCV, 2013.

2.1.2 Feature extraction

Warp all pixels in a tight bounding box around it to the required size(227x227), prior to warping ,we dilate the box p(16) pixels.

2.2 test-time detection

This paper run selective search on the test image to extract around 2000 region proposals(fast mode), Given all scored regions (SVM) in an image, we apply a greedy non-maximum suppression that reject a region if it has an intersection over union**(IoU)** overlap with a higher socring selected region larger than a learned threshold.

2.2.1 run-time analysis

  1. cnn share parameters across all categories( and easier for different num_category)
  2. feature vectors computed by cnn are lower dimensional

13s/image on a gpu, 53s/image on a cup
feature matrix is 2000x4096, svm weight maxtrix is 4096xnumber of classes.

2.3 training

2.3.1 supervised pre-treining

caffe cnn library (nearly matches the performance of Krizhevsk et al.)

2.3.2 domain-specific fine-tuning

replacing the 1000way classification layer with a randomly initialized (N+1)-way classification layer.(plus 1 for background)
We treat all region proposals with >=0.5 IoU overlap with a ground-truth box as positives .
we start SGD at a learning rate of 0.001(1/10th of theinitial pre-training rate, not clobbering the initialization )
In each SGD iteration, we uniformly sample 32 positive windows + 96 background windows to construct a mini-batch of size 128.

2.3.2 Object category classifiers

for a background region, it is easy
but how to label a region taht partially overlaps a car. We resolve this issue with an IoU overlap threshold 0.3 (grid search over 0,0.1,0.2,…0.5, and defined differently in fine-tuning),below are defined as negatives.
Since the training data is too large to fit in memory, we adopt the standard hard negative mining method.

2.4 Results on PASCAL VOC 2010-2012

2010 53.7% 5011-2012 53.3% mAP
UVA same region proposal algorithm + four level spatial pyramid SIFT +nonlinear kernel svm–>35.1%

2.5 Results on ILSVRC2013

31.4%
OverFeat 24.3%

3 Visualization,ablation,and modes of error

3.1 Visualizing learned features

The idea is to single out a particular unit (feature) in the network and ust it as if it were an object detector in its own right.
The follow picture showing top regions for six pool5 units. each pool5 unit has a recptive field of 195x195.
这里写图片描述

3.2 Ablation studies

3.2.1 Performance layer-by-layer, without fine-tuning.

  1. Features from fc7 generalize worse than features from fc6, this means 29% of the parameters can be removed without degrading mAP.
  2. Pool4 features are computed using only6% parameters, but it can produces quite good results.
  3. 1,2–> Much of the CNN’s representational power comes from its convlutional layers .
  4. 3—> this finding suggests potential utility in computing a dense feature map (HOG-like) by using only the convolutional layers of CNN. This representaion would enable exprimentation with sliding-window detectors on top of pool5 features.
    这里写图片描述

3.2.2 Performance layer-by-layer, with fine-tuning.

  1. The boost from fine-tuning is much larger for fc6 and fc7.
  2. 1–>suggests that the pool5 features learned from Image Net are general and that most of the imporvement is gained from learning domain-specific non-linear classifiers on top of them.

3.2.3 comparision to recent feature learning methods

table2 rows8-10

3.3 Network architectures

We have found that the choice of architecure has a large effect on R-CNN detection performance.
1. O-net (13layers of 3x3 convs and 5 pooling layers) outperforms T-net
2. a considerable drawback :7 times longer (forward) than T-net

3.4 Detection error analysis

CNN features are much more discriminative than HOG, loose loaclization likely resluts from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image calssification.
这里写图片描述

复制代码
    D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error
    in object detectors. In ECCV. 2012. 2, 7, 8

3.5 Bounding-box regression

with this simple approach fixes a large number of mislocalized detections, boosting mAP by 3-4 points

复制代码
    P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan.
    Object detection with discriminatively trained part
    based models. TPAMI, 2010.

4 The ILSVRC2013 detection dataset

4.1 dataset overview

The dataset is split into threee sets:train(395918), val(20121), test(40152). Unlike the val and test sets ,the train images are not exhaustively annotated.
Our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples. To use val for both training and validation, and it is splited into roughly equally sized ‘val1’ ,’val2’.

?

it is important to produce an appoxiamtely class-balanced partition.(????????)
The one with the smallest maximun relative class imbalance was selected.

4.2 Region proposals

On val,selective search resulted in an average of 2403 region proposals per image with a 91.6% recall of all ground-truth bounding boxes(0.5 IoU).(PASCLA 98%)

4.3 Training data

training data = val1+N(ground-truth from train)—>val1+trainN
(n<-{0,500,1000})
Training data is required for three procedures in RCNN:
1. CNN fine-tuning(val1+trainN)
2. detector SVM training(val1+trainN)
3. bounding-box regressor training(val1)

4.4 Validation and evaluation

We validated data usage choices and the effect of fine-tuning and bounding-box regression on the val2.(with the same hyperparameters as in PASCAL)

4.5 Ablation study

  1. No-fine-tuning + val1 —>20.9%
  2. No-fine-tuning + val1+trainN —>24.1%(N=500/1000 no difference)
  3. Fine-tuning + val1 —>26.5% (overfitting due to the samll number of positive examples)
  4. Fine-tuning + val1 +train 1000—>29.7%
  5. Bounding-box regression —>31.0%
    这里写图片描述

4.6 Relationship to OverFeat

Overfeat can be seen roughly as a special case of RCNN.
selective search v.s. mulit-scale pyramid of regular square
per-class bounding-box regeressors **v.s.**singe boundign-box regressor
OverFeat 9X FASTER

5 Semantic segmentation

?

CPMC regions
‘full’ strategy=>ignores the region’s shape and computes CNN features directly
‘fg’ strategy=> only on a region’s foreground mask(?????????)
full+fg 1 hour v.s. O2P 10+ hours.
P.S. semantic thesis《Learning Deconvolution Network for Semantic Segmentation》

6 Appendix

6.1 Object proposal transformations

这里写图片描述
top row corresponds to p=0 pixels of the context padding, while the bottom row has p=16 pixels(heigher 3-5 mAP than p=0).

6.2 Positive Vs. negative and softmax

6.2.1 fine-tuning 0.5 SVM 0.3?

Hypothesis: the difference in how positives are difined is not fundamentally important, because1 the fine-tuning data is limited and2 fine-tuning doesnot emphasize precise localization

6.2.2 why not softmax?

?

the softmax classifier was trained on randomly sampled negative examples rather than on the subset of ‘hard negatives’ used for svm training (??????)
this result shows that it’s possible to obtain close to the same level of performance without training svms after fine-tuning.(??????)

6.3 Bounding-box regression

P_x P_y P_w P_h specifies the pixel coordinates of the center of proposal p’s boungding box togeter with P’s width and height in pixels
这里写图片描述
这里写图片描述
这里写图片描述
Two issues:
1. \lambda is important
2. P is the nearest(maximun IoU) to G

6.4 Analysis of cross-dataset redundancy

?

outline

  • Authors

  • Abstract

  • Introduction

    • 1 from hog-cnns
    • 2 two problems
    • 3 efficient
    • 4 dominant error mode
  • Object detection with R-CNN

    • 1 module design

      • 11 region proposals
      • 12 Feature extraction
    • 2 test-time detection

      • 21 run-time analysis
    • 3 training

      • 31 supervised pre-treining
      • 32 domain-specific fine-tuning
      • 32 Object category classifiers
    • 4 Results on PASCAL VOC 2010-2012

    • 5 Results on ILSVRC2013

  • Visualizationablationand modes of error

    • 1 Visualizing learned features

    • 2 Ablation studies

      • 21 Performance layer-by-layer without fine-tuning
      • 22 Performance layer-by-layer with fine-tuning
      • 23 comparision to recent feature learning methods
    • 3 Network architectures

    • 4 Detection error analysis

    • 5 Bounding-box regression

  • The ILSVRC2013 detection dataset

    • 1 dataset overview
    • 2 Region proposals
    • 3 Training data
    • 4 Validation and evaluation
    • 5 Ablation study
    • 6 Relationship to OverFeat
  • Semantic segmentation

    • Appendix
复制代码
* 1 Object proposal transformations
* 2 Positive Vs negative and softmax 
  * 21 fine-tuning 05 SVM 03
  * 22 why not softmax
    • 3 Bounding-box regression
    • 4 Analysis of cross-dataset redundancy
    • outline

全部评论 (0)

还没有任何评论哟~