본문 바로가기

Data-science/deep learning

1 stage Detection SOTA - M2Det 정리

728x90

M2Det 2독

Beyond scale variation, appearance-complexity variation should be considered too for the object detection task, due to that the object instances with similar size can be quite different.

Deeper level learns features for objects with more appearance-complexity variation(e.g., pedestrian), while shallower level learns features for more simplistic objects(e.g., traffic light).

1. Multi Level FPN

  1. Based on MLFPN, we propose a single-shot object detector: M2Det, which represents the Multi-Level Multi-Scale Detector.

Methodology:

a. Construct the base feature:

b. The Multi-level Multi-scale feature:

All of the outputs in the decoder of each TUM form the multi-scale features of the current level.

 

 

As a whole, the outputs of stacked TUMs form the multi-level multi-scale features, while the front TUM mainly provides shallow-level features, the middle TUM provides mediumlevel features, and the back TUM provides deep-level features.

c. Scale-wise Feature Aggregation Module:

simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most.

 

detection:

At the detection stage, we add two convolution layers to each of the 6 pyramidal features to achieve location regression and classification respectively. The detection scale ranges of the default boxes of the six feature maps follow the setting of the original SSD.

 

At each pixel of the pyramidal features, we set six anchors with three ratios entirely. Afterward, we use a probability score of 0.05 as threshold to filter out most anchors with low scores.

 

handling appearance-complexity variation across object instances.:

To verify that the proposed MLFPN can learn effective feature for detecting objects with different scales and large appearance variation, we visualize the activation values of classification Conv layers along scale and level dimensions, such an example shown in Fig. 6.

 

that: 1) our method learns very effective features to handle scale variation and appearance-complexity variation across object instances; 2) it is necessary to use multi-level features to detect objects with similar size.

 

Conclusion

First, multi-level features (i.e. multiple layers) extracted by backbone are fused by a Feature Fusion Module (FFMv1) as the base feature.

Second, the base feature is fed into a block of alternating joint Thinned U-shape Modules (TUMs) and Fature Fusion Modules (FFMv2s) and multi-level multi-scale features (i.e. the decoder layers of each TUM) are extracted

Finally, the extracted multi-level multi-scale features with the same scale (size) are aggregated to construct a feature pyramid for object detection by a Scale-wise Feature Aggregation Module (SFAM).

new state-of-the-art result (i.e. AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy)

 

ref)https://qijiezhao.github.io/imgs/m2det.pdf

불러오는 중입니다...