當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

『TensorFlow』SSD源码学习_其二：基于VGG的SSD网络前向架构

發布時間：2023/12/20 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了『TensorFlow』SSD源码学习_其二：基于VGG的SSD网络前向架构小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Fork版本項目地址：SSD

參考自集智專欄

一、SSD基礎

在分類器基礎之上想要識別物體，實質就是 用分類器掃描整張圖像，定位特征位置 。這里的關鍵就是用什么算法掃描，比如可以將圖片分成若干網格，用分類器一個格子、一個格子掃描，這種方法有幾個問題：

問題1 ：目標正好處在兩個網格交界處，就會造成分類器的結果在兩邊都不足夠顯著，造成漏報（True Negative）。

問題2 ：目標過大或過小，導致網格中結果不足夠顯著，造成漏報。

針對第一點，可以采用相互重疊的網格。比如一個網格大小是 32x32 像素，那么就網格向下移動時，只動 8 個像素，走四步才完全移出以前的網格。針對第二點，可以采用大小網格相互結合的策略，32x32 網格掃完，64x64 網格再掃描一次，16x16 網格也再掃一次。

但是這樣會帶來其他問題——我們為了保證準確率， 對同一張圖片掃描次數過多，嚴重影響了計算速度 ，造成這種策略 無法做到實時標注 。

為了快速、實時標注圖像特征，對于整個識別定位算法，就有了諸多改進方法。

一個最基本的思路是，合理使用卷積神經網絡的內部結構，避免重復計算。用卷積神經網絡掃描某一圖片時，實際上卷積得到的結果已經存儲了不同大小的網格信息，這一過程實際上已經完成了我們上一部分提出的改進措施，如下圖所示，我們發現前幾層卷積核的結果更關注細節，后面的卷積層結果更加關注整體：

對于問題1，如果一個物體位于兩個格子的中間，雖然兩邊都不一定足夠顯著，但是兩邊的基本特征如果可以合理組合的話，我們就不需要再掃描一次。而后幾層則越來越關注整體，對問題2，目標可能會過大過小，但是特征同樣也會留下。也就是說，用卷積神經網絡掃描圖像過程中，由于深度神經網絡本身就有好幾層卷積、實際上已經反復多次掃描圖像，以上兩個問題可以通過合理使用卷積神經網絡的中間結果得到解決。

在 SSD 算法之前，MultiBox，FastR-CNN 法都采用了兩步的策略，即第一步通過深度神經網絡，對潛在的目標物體進行定位，即先產生Box；至于Box 里面的物體如何分類，這里再進行第二步計算。此外第一代的 YOLO 算法可以做到一步完成計算加定位，但是結構中采用了全連接層，而全連接層有很多問題，并且正在逐步被深度神經網絡架構“拋棄”。

二、TF_SSD項目中網絡的結構

回到項目中，以VGG300（/nets/ssd_vgg_300.py）為例，大體思路就是，用VGG 深度神經網絡的前五層，并額外多加幾層結構，最后提取其中幾層進過卷積后的結果，進行網格搜索，找目標特征。對應到函數里，轉化為三個大部分，原網絡結構、添加網絡結構、SSD處理結構：

def ssd_net(inputs,num_classes=SSDNet.default_params.num_classes,feat_layers=SSDNet.default_params.feat_layers,anchor_sizes=SSDNet.default_params.anchor_sizes,anchor_ratios=SSDNet.default_params.anchor_ratios,normalizations=SSDNet.default_params.normalizations,is_training=True,dropout_keep_prob=0.5,prediction_fn=slim.softmax,reuse=None,scope='ssd_300_vgg'):"""SSD net definition."""# if data_format == 'NCHW':# inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))# End_points collect relevant activations for external use."""net = layers_lib.repeat(inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')"""end_points = {}with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):####################################### 前五個 Blocks，首先照搬 VGG16 架構 ## 注意這里使用 end_points 標注中間結果 ######################################## ——————————————————Original VGG-16 blocks.———————————————————————net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')end_points['block1'] = netnet = slim.max_pool2d(net, [2, 2], scope='pool1')# Block 2.net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')end_points['block2'] = netnet = slim.max_pool2d(net, [2, 2], scope='pool2')# Block 3.net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')end_points['block3'] = netnet = slim.max_pool2d(net, [2, 2], scope='pool3')# Block 4.net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')end_points['block4'] = netnet = slim.max_pool2d(net, [2, 2], scope='pool4')# Block 5.net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')end_points['block5'] = netnet = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5') # 池化層步長由2修改到三##################################### 后六個 Blocks，使用額外卷積層 ###################################### ————————————Additional SSD blocks.——————————————————————# Block 6: let's dilate the hell out of it!net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')end_points['block6'] = netnet = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)# Block 7: 1x1 conv. Because the fuck.net = slim.conv2d(net, 1024, [1, 1], scope='conv7')end_points['block7'] = netnet = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)# Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).end_point = 'block8'with tf.variable_scope(end_point):net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')net = custom_layers.pad2d(net, pad=(1, 1))net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')end_points[end_point] = netend_point = 'block9'with tf.variable_scope(end_point):net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')net = custom_layers.pad2d(net, pad=(1, 1))net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')end_points[end_point] = netend_point = 'block10'with tf.variable_scope(end_point):net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')end_points[end_point] = netend_point = 'block11'with tf.variable_scope(end_point):net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')end_points[end_point] = net####################################### 每個中間層 end_points 返回中間結果 ## 將各層預測結果存入列表，返回給優化函數 ######################################## Prediction and localisations layers.predictions = []logits = []localisations = []# feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']for i, layer in enumerate(feat_layers):with tf.variable_scope(layer + '_box'):p, l = ssd_multibox_layer(end_points[layer],num_classes,anchor_sizes[i],anchor_ratios[i],normalizations[i])"""框的數目等于anchor_sizes[i]和anchor_ratios[i]的長度和anchor_sizes=[(21., 45.),(45., 99.),(99., 153.),(153., 207.),(207., 261.),(261., 315.)]anchor_ratios=[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]]normalizations=[20, -1, -1, -1, -1, -1]"""predictions.append(prediction_fn(p)) # prediction_fn=slim.softmaxlogits.append(p)localisations.append(l)return predictions, localisations, logits, end_points ssd_net.default_image_size = 300

在整個函數最后，給出了ssd_arg_scope函數，用于約束網絡中的超參數設定，用法腳本頭中已經給了：

Usage:
with slim.arg_scope(ssd_vgg.ssd_vgg()):
outputs, end_points = ssd_vgg.ssd_vgg(inputs)
def ssd_arg_scope(weight_decay=0.0005, data_format='NHWC'):"""Defines the VGG arg scope.Args:weight_decay: The l2 regularization coefficient.Returns:An arg_scope."""with slim.arg_scope([slim.conv2d, slim.fully_connected],activation_fn=tf.nn.relu,weights_regularizer=slim.l2_regularizer(weight_decay),weights_initializer=tf.contrib.layers.xavier_initializer(),biases_initializer=tf.zeros_initializer()):with slim.arg_scope([slim.conv2d, slim.max_pool2d],padding='SAME',data_format=data_format):with slim.arg_scope([custom_layers.pad2d,custom_layers.l2_normalization,custom_layers.channel_to_last],data_format=data_format) as sc:return sc

a、超參數設定

實際上原程序中超參數作為一個class屬性給出的，我們現在不關心這個class的信息，僅僅將其包含超參數設定的部分提取出來，提升對上面網絡的理解，

SSDParams = namedtuple('SSDParameters', ['img_shape','num_classes','no_annotation_label','feat_layers','feat_shapes','anchor_size_bounds','anchor_sizes','anchor_ratios','anchor_steps','anchor_offset','normalizations','prior_scaling'])class SSDNet(object):default_params = SSDParams(img_shape=(300, 300),num_classes=21,no_annotation_label=21,feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],anchor_size_bounds=[0.15, 0.90],# anchor_size_bounds=[0.20, 0.90],anchor_sizes=[(21., 45.),(45., 99.),(99., 153.),(153., 207.),(207., 261.),(261., 315.)],anchor_ratios=[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]],anchor_steps=[8, 16, 32, 64, 100, 300],anchor_offset=0.5,normalizations=[1, -1, -1, -1, -1, -1], # 控制SSD層處理時是否預先沿著HW正則化prior_scaling=[0.1, 0.1, 0.2, 0.2])

b、SSD處理結構

# Prediction and localisations layers.predictions = []logits = []localisations = []# feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']for i, layer in enumerate(feat_layers):with tf.variable_scope(layer + '_box'):p, l = ssd_multibox_layer(end_points[layer], # <-----SSD處理num_classes,anchor_sizes[i],anchor_ratios[i],normalizations[i])predictions.append(prediction_fn(p)) # prediction_fn=slim.softmaxlogits.append(p)localisations.append(l)return predictions, localisations, logits, end_points

在網絡架構的最后，會對選取的特征層外接新的卷積處理（上面代碼），處理函數如下：

def tensor_shape(x, rank=3):"""Returns the dimensions of a tensor.Args:image: A N-D Tensor of shape.Returns:A list of dimensions. Dimensions that are statically known are pythonintegers,otherwise they are integer scalar tensors."""if x.get_shape().is_fully_defined():return x.get_shape().as_list()else:# get_shape返回值，with_rank相當于斷言assert，是否rank為指定值static_shape = x.get_shape().with_rank(rank).as_list()# tf.shape返回張量，其中num解釋為"The length of the dimension `axis`."，axis默認為0dynamic_shape = tf.unstack(tf.shape(x), num=rank)# list，有定義的給數字，沒有的給tensorreturn [s if s is not None else dfor s, d in zip(static_shape, dynamic_shape)]def ssd_multibox_layer(inputs,num_classes,sizes,ratios=[1],normalization=-1,bn_normalization=False):"""Construct a multibox layer, return a class and localization predictions."""net = inputsif normalization > 0:net = custom_layers.l2_normalization(net, scaling=True)# Number of anchors.num_anchors = len(sizes) + len(ratios)# Location.num_loc_pred = num_anchors * 4 # 每一個框有四個坐標loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,scope='conv_loc') # 輸出C表示不同框的某個坐標# 強制轉換為NHWCloc_pred = custom_layers.channel_to_last(loc_pred)# NHW(num_anchors+4)->NHW,num_anchors,4loc_pred = tf.reshape(loc_pred,tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])# Class prediction.num_cls_pred = num_anchors * num_classes # 每一個框都要計算所有的類別cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,scope='conv_cls') # 輸出C表示不同框的對某個類的預測# 強制轉換為NHWCcls_pred = custom_layers.channel_to_last(cls_pred)# NHW(num_anchors+類別)->NHW,num_anchors,類別cls_pred = tf.reshape(cls_pred,tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])return cls_pred, loc_pred

根據是否正則化的的參數，對特征層進行L2正則化（空間維度C上正則化），具體流程見下節

然后并行的在選定特征層后面加上兩個卷積，一個輸出通道為num_anchors×4，一個輸出通道為num_anchors×類別數

將兩個卷積的輸出格維度各自擴展一維，排序轉換為：[NHW,num_anchors,4] 和 [NHW,num_anchors,類別]

此時我們可以知道網絡結構函數的返回的意義了：各個指定層SSD處理后輸出的框對類別的概率，各個指定層SSD處理后輸出的框坐標修正，各個指定層SSD處理后輸出的框對類別的原始輸出，所有中間層的end_point。

c、custom_layers.l2_normalization：特征層L2正則化

首先在特征層維度進行正則化，過程見nn.l2_normalize，然后對每一個層取一個scale因子，對各個層放縮調整（因子是可學習的），最后返回這個調整后的特征

@add_arg_scope def l2_normalization(inputs,scaling=False,scale_initializer=init_ops.ones_initializer(),reuse=None,variables_collections=None,outputs_collections=None,data_format='NHWC',trainable=True,scope=None):"""Implement L2 normalization on every feature (i.e. spatial normalization).Should be extended in some near future to other dimensions, providing a moreflexible normalization framework.Args:inputs: a 4-D tensor with dimensions [batch_size, height, width, channels].scaling: whether or not to add a post scaling operation along the dimensionswhich have been normalized.scale_initializer: An initializer for the weights.reuse: whether or not the layer and its variables should be reused. To beable to reuse the layer scope must be given.variables_collections: optional list of collections for all the variables ora dictionary containing a different list of collection per variable.outputs_collections: collection to add the outputs.data_format: NHWC or NCHW data format.trainable: If `True` also add variables to the graph collection`GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).scope: Optional scope for `variable_scope`.Returns:A `Tensor` representing the output of the operation."""with variable_scope.variable_scope(scope, 'L2Normalization', [inputs], reuse=reuse) as sc:inputs_shape = inputs.get_shape()inputs_rank = inputs_shape.ndimsdtype = inputs.dtype.base_dtype# 在C上做l2標準化if data_format == 'NHWC':# norm_dim = tf.range(1, inputs_rank-1)norm_dim = tf.range(inputs_rank-1, inputs_rank)params_shape = inputs_shape[-1:]elif data_format == 'NCHW':# norm_dim = tf.range(2, inputs_rank)norm_dim = tf.range(1, 2)params_shape = (inputs_shape[1])# Normalize along spatial dimensions.outputs = nn.l2_normalize(inputs, norm_dim, epsilon=1e-12)# Additional scaling.if scaling:# 從collections獲取變量scale_collections = utils.get_variable_collections(variables_collections, 'scale')# 創建變量，shape=C的層數scale = variables.model_variable('gamma',shape=params_shape,dtype=dtype,initializer=scale_initializer,collections=scale_collections,trainable=trainable)if data_format == 'NHWC':outputs = tf.multiply(outputs, scale)elif data_format == 'NCHW':scale = tf.expand_dims(scale, axis=-1)scale = tf.expand_dims(scale, axis=-1)outputs = tf.multiply(outputs, scale)# outputs = tf.transpose(outputs, perm=(0, 2, 3, 1))# 為outputs添加別名，并將之收集進collection，返回原節點return utils.collect_named_outputs(outputs_collections,sc.original_name_scope, outputs)

至此，網絡結構的介紹就完成了，下一節我們將關注目標檢測模型的關鍵技術之一：定位框的生成，并串聯本節，理解整個SSD網絡的生成過程。

附錄、相關實現

custom_layers.channel_to_last：NHWC轉化

@add_arg_scope # 層可以被slim.arg_scope設定 def channel_to_last(inputs,data_format='NHWC',scope=None):"""Move the channel axis to the last dimension. Allows toprovide a single output format whatever the input data format.Args:inputs: Input Tensor;data_format: NHWC or NCHW.Return:Input in NHWC format."""with tf.name_scope(scope, 'channel_to_last', [inputs]):if data_format == 'NHWC':net = inputselif data_format == 'NCHW':net = tf.transpose(inputs, perm=(0, 2, 3, 1))return net

custom_layers.pad2d：2D-tensor填充

@add_arg_scope # 層可以被slim.arg_scope設定 def pad2d(inputs,pad=(0, 0),mode='CONSTANT',data_format='NHWC',trainable=True,scope=None):"""2D Padding layer, adding a symmetric padding to H and W dimensions.Aims to mimic padding in Caffe and MXNet, helping the port of models toTensorFlow. Tries to follow the naming convention of `tf.contrib.layers`.Args:inputs: 4D input Tensor;pad: 2-Tuple with padding values for H and W dimensions;（填充的寬度）mode: Padding mode. C.f. `tf.pad`data_format: NHWC or NCHW data format."""with tf.name_scope(scope, 'pad2d', [inputs]):# Padding shape.if data_format == 'NHWC':paddings = [[0, 0], [pad[0], pad[0]], [pad[1], pad[1]], [0, 0]]elif data_format == 'NCHW':paddings = [[0, 0], [0, 0], [pad[0], pad[0]], [pad[1], pad[1]]]net = tf.pad(inputs, paddings, mode=mode)return net

slim的vgg_16

def vgg_16(inputs,num_classes=1000,is_training=True,dropout_keep_prob=0.5,spatial_squeeze=True,scope='vgg_16'):"""Oxford Net VGG 16-Layers version D Example.Note: All the fully_connected layers have been transformed to conv2d layers.To use in classification mode, resize input to 224x224.Args:inputs: a tensor of size [batch_size, height, width, channels].num_classes: number of predicted classes.is_training: whether or not the model is being trained.dropout_keep_prob: the probability that activations are kept in the dropoutlayers during training.spatial_squeeze: whether or not should squeeze the spatial dimensions of theoutputs. Useful to remove unnecessary dimensions for classification.scope: Optional scope for the variables.Returns:the last op containing the log predictions and end_points dict."""with variable_scope.variable_scope(scope, 'vgg_16', [inputs]) as sc:end_points_collection = sc.original_name_scope + '_end_points'# Collect outputs for conv2d, fully_connected and max_pool2d.with arg_scope([layers.conv2d, layers_lib.fully_connected, layers_lib.max_pool2d],outputs_collections=end_points_collection):net = layers_lib.repeat(inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')# Use conv2d instead of fully_connected layers.net = layers.conv2d(net, 4096, [7, 7], padding='VALID', scope='fc6')net = layers_lib.dropout(net, dropout_keep_prob, is_training=is_training, scope='dropout6')net = layers.conv2d(net, 4096, [1, 1], scope='fc7')net = layers_lib.dropout(net, dropout_keep_prob, is_training=is_training, scope='dropout7')net = layers.conv2d(net,num_classes, [1, 1],activation_fn=None,normalizer_fn=None,scope='fc8')# Convert end_points_collection into a end_point dict.end_points = utils.convert_collection_to_dict(end_points_collection)if spatial_squeeze:net = array_ops.squeeze(net, [1, 2], name='fc8/squeezed')end_points[sc.name + '/fc8'] = netreturn net, end_pointsvgg_16.default_image_size = 224

不常用API記錄

nn.l2_normalize：L2正則化層

slim.repeat：重復層快速構建

Tensor.get_shape().with_rank(rank).as_list()：加類似斷言的shape獲取函數

tensorflow.contrib.layers.python.layers.utils.collect_named_outputs：變量添加進collections，并取別名

總結

以上是生活随笔為你收集整理的『TensorFlow』SSD源码学习_其二：基于VGG的SSD网络前向架构的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：第三章——jXLS Excel标记
下一篇： SAS,SATA普及文档