【目标检测】cvpr21_Sparse R-CNN: End-to-End Object Detection with Learnable Proposals
文章目錄
- 一、背景
- 二、動機
- 三、方法
- 3.1 Backbone
- 3.2 Learnable proposal box
- 3.3 Learnable proposal feature
- 3.4 Dynamic instance interactive head
- 3.5 Set prediction loss
- 四、效果
- 五、代碼
一、背景
目前的目標檢測方法很大程度上依賴于密集的候選框,如在特征圖 H×WH \times WH×W 上的每個 grid 預設 kkk 個 anchors,且取得了較好的效果。
但這些方法有以下問題:
- 這些方法通常會生成冗余且近似重復的結果,所以必須使用 NMS
- many-to-one label assignment 的方法對規則很敏感
- 最終的性能和 anchor box 的尺寸、寬高比、個數、生成規則等有很大的關系
二、動機
作者認為 sparse property 來源于兩個方面:sparse box & sparse features
- sparse box:候選框很少
- 每個box中的特征不需要和全圖中的所有目標進行交互
所以作者提出看 Sparse R-CNN,如圖1?所示,目標候選框是
本文中,作者使用了學習得到的 NNN 個 anchor,來進行后續的分類的回歸。把成百上千的手工設計的 anchor 縮減為了學習得到的大約 100 個 proposals。避免了手工設計和 many-to-one 的 label 分配,且最后不需要 NMS 。
三、方法
Sparse R-CNN 的中心思想:使用少的 proposal boxes(100)來代替 RPN 生成的大量 candidate,如圖3所示。
- backbone:
- dynamic instance interactive head
- cls head
- reg head
3.1 Backbone
FPN based ResNet(using p2→p5)
channel:256
其余設置:和 Faster RCNN 基本相同
3.2 Learnable proposal box
維度:N×4N \times 4N×4
初始化: ranging from 0 to 1
優化:在訓練的過程中使用反向傳播來優化
從概念上來講,Sparse R-CNN 學習到的統計特征,是訓練集中最可能出現目標的位置。RPN中,是和當前圖像有很強關聯 proposal,并且提供的是一個粗糙的圖像位置。
所以,Sparse R-CNN 可以被認為是目標檢測器的擴展,從 完全 dense→dense-to-sparse→ 完全 sparse。
3.3 Learnable proposal feature
由于上面 4-d 的proposal box是對目標的簡單直接的描述,提供了粗糙的位置,但丟失了如目標的姿態和形狀等信息。
此處作者提出了另外一個概念——proposal feature(N x d),它是一個高維(例如256)潛在向量,預計用于編碼豐富的實例特征。proposal feature 的數量和 box一樣。
3.4 Dynamic instance interactive head
給定 N 個proposal boxes:
- Sparse R-CNN 首先使用 RoIAlign 來對每個box抽取特征
- 然后每個 box 的特征會使用 prediction head 來生成最終的預測
dynamic interactive head:
- 對于 N 個proposal box,使用 N 個 proposal feature
- 每個 RoI feature 將會和對應的 proposal feature進行交互,過濾掉無用的bin,輸出最終的目標特征
- 模塊的簡單形式:1x1 conv + 1x1 conv + relu
3.5 Set prediction loss
Sparse R-CNN 在固定尺寸的分類和回歸預測上使用 set prediction loss。
- 分類:focal loss
- L1:中心點 loss
- giou:寬和高 loss
- 權重:2:5:2
四、效果
作者提出了兩種版本的 Sparse R-CNN:
- 第一種:使用 100 個學習的 proposal boxes,不使用隨機 crop 的數據增強
- 第二種*:使用 300 個學習的 proposal boxes,使用隨機 crop 的數據增強
Sparse R-CNN:
- 速度是 DETR 的10x(圖2)
- AP: 45.0 AP vs. 43.8 AP
- 速度:22FPS vs. 19FPS
- 迭代周期:36 epochs vs. 50 epochs
Proposal boxes 的性能:
- 學習到的 proposal boxes 通常會覆蓋整個圖片,來保證recall
- 每個 stage 會逐步 refine 每個 box 的位置,并且去除掉重復的
- 圖5同事也體現了box對目標稀疏場景和目標復雜場景的魯棒性,對稀疏場景,前幾個stage就會把大量重復的框丟掉,對復雜場景,會經過多個stage后實現對每個目標的精準檢測
五、代碼
代碼路徑:https://github.com/PeizeSun/SparseR-CNN
# 訓練 python projects/SparseRCNN/train_net.py --num-gpus 8 \--config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml # 評測 python projects/SparseRCNN/train_net.py --num-gpus 8 \--config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml \--eval-only MODEL.WEIGHTS path/to/model.pth # 可視化 python demo/demo.py\--config-file projects/SparseRCNN/configs/sparsercnn.res50.100pro.3x.yaml \--input path/to/images --output path/to/save_images --confidence-threshold 0.4 \--opts MODEL.WEIGHTS path/to/model.pthHead 結構:
(init_proposal_features): Embedding(100, 256)(init_proposal_boxes): Embedding(100, 4)(head): DynamicHead((box_pooler): ROIPooler((level_poolers): ModuleList((0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=2, aligned=True)(1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=2, aligned=True)(2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=2, aligned=True)(3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=2, aligned=True)))(head_series): ModuleList((0): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=2, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))(1): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=2, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))(2): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=2, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))(3): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=2, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))(4): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=2, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))(5): RCNNHead((self_attn): MultiheadAttention((out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True))(inst_interact): DynamicConv((dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)(norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(activation): ReLU(inplace=True)(out_layer): Linear(in_features=12544, out_features=256, bias=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.0, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.0, inplace=False)(dropout2): Dropout(p=0.0, inplace=False)(dropout3): Dropout(p=0.0, inplace=False)(cls_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True))(reg_module): ModuleList((0): Linear(in_features=256, out_features=256, bias=False)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): ReLU(inplace=True)(3): Linear(in_features=256, out_features=256, bias=False)(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(5): ReLU(inplace=True)(6): Linear(in_features=256, out_features=256, bias=False)(7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(8): ReLU(inplace=True))(class_logits): Linear(in_features=256, out_features=80, bias=True)(bboxes_delta): Linear(in_features=256, out_features=4, bias=True))))(criterion): SetCriterion((matcher): HungarianMatcher()) ) cfg.MODEL.SparseRCNN: CfgNode({'NUM_CLASSES': 80, 'NUM_PROPOSALS': 100, 'NHEADS': 8, 'DROPOUT': 0.0, 'DIM_FEEDFORWARD': 2048, 'ACTIVATION': 'relu', 'HIDDEN_DIM': 256, 'NUM_CLS': 1, 'NUM_REG': 3, 'NUM_HEADS': 6, 'NUM_DYNAMIC': 2, 'DIM_DYNAMIC': 64, 'CLASS_WEIGHT': 2.0, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': True, 'NO_OBJECT_WEIGHT': 0.1, 'USE_FOCAL': True, 'ALPHA': 0.25, 'GAMMA': 2.0, 'PRIOR_PROB': 0.01}) projects/SparseRCNN/sparsercnn/head.py # line 173 features:(Pdb) p features[0].shape torch.Size([1, 256, 152, 232]) (Pdb) p features[1].shape torch.Size([1, 256, 76, 116]) (Pdb) p features[2].shape torch.Size([1, 256, 38, 58]) (Pdb) p features[3].shape torch.Size([1, 256, 19, 29]) bboxes.shape torch.size([1, 100, 4]) pooler: ROIPooler((level_poolers): ModuleList((0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=2, aligned=True)(1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=2, aligned=True)(2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=2, aligned=True)(3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=2, aligned=True)) ) pro_features.shape torch.Size([1, 100, 256]) roi_features: 100,256,7,7 -> 49, 100, 256 weight_dict:{'loss_ce': 2.0, 'loss_bbox': 5.0, 'loss_giou': 2.0, 'loss_ce_0': 2.0, 'loss_bbox_0': 5.0, 'loss_giou_0': 2.0, 'loss_ce_1': 2.0, 'loss_bbox_1': 5.0, 'loss_giou_1': 2.0, 'loss_ce_2': 2.0, 'loss_bbox_2': 5.0, 'loss_giou_2': 2.0, 'loss_ce_3': 2.0, 'loss_bbox_3': 5.0, 'loss_giou_3': 2.0, 'loss_ce_4': 2.0, 'loss_bbox_4': 5.0, 'loss_giou_4': 2.0}真值獲取:
detector.py if self.training:gt_instances = [x["instances"].to(self.device) for x in batched_inputs]targets = self.prepare_targets(gt_instances)if self.deep_supervision:output['aux_outputs'] = [{'pred_logits': a, 'pred_boxes': b}for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]loss_dict = self.criterion(output, targets)weight_dict = self.criterion.weight_dictfor k in loss_dict.keys():if k in weight_dict:loss_dict[k] *= weight_dict[k]return loss_dict總結
以上是生活随笔為你收集整理的【目标检测】cvpr21_Sparse R-CNN: End-to-End Object Detection with Learnable Proposals的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 早报:小米集团全球副总裁辞职 京东方发布
- 下一篇: maskrcnn用于目标检测_用于目标检