當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

SSD: Signle Shot Detector 用于自然场景文字检测

發(fā)布時間：2024/9/21 编程问答 53 豆豆

生活随笔收集整理的這篇文章主要介紹了 SSD: Signle Shot Detector 用于自然场景文字检测小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

之前我在論文閱讀：SSD: Single Shot MultiBox Detector 中，講了這個最新的 Object Detection 算法。

既然 SSD 是用來檢測物體的，那么可不可以將 SSD 用來檢測自然場景圖像中的文字呢？答案肯定是可以的～

同時，受到浙大 solace_hyh 同學的 ssd-plate_detection 工作，這篇文章記錄我自己將 SSD 用于文字檢測的過程。

全部的代碼上傳到 Github 了：https://github.com/chenxinpeng/SSD_scene-text-detection，代碼質(zhì)量不太高，還請高手指點。^_^

準備與轉(zhuǎn)換數(shù)據(jù)集

ICDAR 2011 數(shù)據(jù)集訓練集共有 229 張圖像，我將其分為 159 張、70張圖像兩部分。前者用作訓練，后者用于訓練時進行測試。

下面就是要將這些圖像，轉(zhuǎn)換成 lmdb 格式，用于 caffe 訓練；將文字區(qū)域的標簽，轉(zhuǎn)換為 Pascal VOC 的 XML 格式。

將 ground truth 轉(zhuǎn)換為 Pascal VOC XML 文件

先將 ICDAR 2011 給定的 gt_**.txt 標簽文件轉(zhuǎn)換為 Pascal VOC XML 格式。

先看下原來的 gt_**.txt 格式，如下圖，有一張原始圖像：

下面是其 ground truth 文件：

158,128,412,182,"Footpath" 442,128,501,170,"To" 393,198,488,240,"and" 63,200,363,242,"Colchester" 71,271,383,313,"Greenstead"

ground truth 文件格式為：xmin,?ymin,?xmax,?ymax,?label。同時，要注意，這里的坐標系是如下擺放：

將 ground truth 的 txt 文件轉(zhuǎn)換為 Pascal VOC 的 XML 格式的代碼如下：

#! /usr/bin/pythonimport os, sys import glob from PIL import Image# ICDAR 圖像存儲位置 src_img_dir = "/media/chenxp/Datadisk/ocr_dataset/ICDAR2011/train-textloc" # ICDAR 圖像的 ground truth 的 txt 文件存放位置 src_txt_dir = "/media/chenxp/Datadisk/ocr_dataset/ICDAR2011/train-textloc"img_Lists = glob.glob(src_img_dir + '/*.jpg')img_basenames = [] # e.g. 100.jpg for item in img_Lists:img_basenames.append(os.path.basename(item))img_names = [] # e.g. 100 for item in img_basenames:temp1, temp2 = os.path.splitext(item)img_names.append(temp1)for img in img_names:im = Image.open((src_img_dir + '/' + img + '.jpg'))width, height = im.size# open the crospronding txt filegt = open(src_txt_dir + '/gt_' + img + '.txt').read().splitlines()# write in xml fileos.mknod(src_txt_dir + '/' + img + '.xml')xml_file = open((src_txt_dir + '/' + img + '.xml'), 'w')xml_file.write('<annotation>\n')xml_file.write(' <folder>VOC2007</folder>\n')xml_file.write(' <filename>' + str(img) + '.jpg' + '</filename>\n')xml_file.write(' <size>\n')xml_file.write(' <width>' + str(width) + '</width>\n')xml_file.write(' <height>' + str(height) + '</height>\n')xml_file.write(' <depth>3</depth>\n')xml_file.write(' </size>\n')# write the region of text on xml filefor img_each_label in gt:spt = img_each_label.split(',')xml_file.write(' <object>\n')xml_file.write(' <name>text</name>\n')xml_file.write(' <pose>Unspecified</pose>\n')xml_file.write(' <truncated>0</truncated>\n')xml_file.write(' <difficult>0</difficult>\n')xml_file.write(' <bndbox>\n')xml_file.write(' <xmin>' + str(spt[0]) + '</xmin>\n')xml_file.write(' <ymin>' + str(spt[1]) + '</ymin>\n')xml_file.write(' <xmax>' + str(spt[2]) + '</xmax>\n')xml_file.write(' <ymax>' + str(spt[3]) + '</ymax>\n')xml_file.write(' </bndbox>\n')xml_file.write(' </object>\n')xml_file.write('</annotation>')

x上面代碼運行結(jié)果是得到如下的 XML 文件，同樣用上面的 100.jpg 圖像示例，其轉(zhuǎn)換結(jié)果如下：

上面代碼生成的 XML 文件，與圖像文件存儲在一個地方。

生成訓練圖像與 XML 標簽的位置文件

這一步，按照 SSD 訓練的需求，將圖像位置，及其對應的 XML 文件位置寫入一個 txt 文件，供訓練時讀取，一個文件名稱叫做：trainval.txt 文件，另一個叫做：test.txt 文件。形式如下：

scenetext/JPEGImages/106.jpg scenetext/Annotations/106.xml scenetext/JPEGImages/203.jpg scenetext/Annotations/203.xml scenetext/JPEGImages/258.jpg scenetext/Annotations/258.xml scenetext/JPEGImages/122.jpg scenetext/Annotations/122.xml scenetext/JPEGImages/103.jpg scenetext/Annotations/103.xml scenetext/JPEGImages/213.jpg scenetext/Annotations/213.xml scenetext/JPEGImages/149.jpg scenetext/Annotations/149.xml ......

生成的代碼如下：

#! /usr/bin/pythonimport os, sys import globtrainval_dir = "/home/chenxp/data/VOCdevkit/scenetext/trainval" test_dir = "/home/chenxp/data/VOCdevkit/scenetext/test"trainval_img_lists = glob.glob(trainval_dir + '/*.jpg') trainval_img_names = [] for item in trainval_img_lists:temp1, temp2 = os.path.splitext(os.path.basename(item))trainval_img_names.append(temp1)test_img_lists = glob.glob(test_dir + '/*.jpg') test_img_names = [] for item in test_img_lists:temp1, temp2 = os.path.splitext(os.path.basename(item))test_img_names.append(temp1)dist_img_dir = "scenetext/JPEGImages" dist_anno_dir = "scenetext/Annotations"trainval_fd = open("/home/chenxp/caffe/data/scenetext/trainval.txt", 'w') test_fd = open("/home/chenxp/caffe/data/scenetext/test.txt", 'w')for item in trainval_img_names:trainval_fd.write(dist_img_dir + '/' + str(item) + '.jpg' + ' ' + dist_anno_dir + '/' + str(item) + '.xml\n')for item in test_img_names:test_fd.write(dist_img_dir + '/' + str(item) + '.jpg' + ' ' + dist_anno_dir + '/' + str(item) + '.xml\n')

生成 test name size 文本文件

這一步，SSD 還需要一個名叫：test_name_size.txt 的文件，里面記錄訓練圖像、測試圖像的圖像名稱、height、width。內(nèi)容形式如下：

106 480 640 203 480 640 258 480 640 318 480 640 122 480 640 103 480 640 320 640 480 ......

生成這個文本文件的代碼如下：

#! /usr/bin/pythonimport os, sys import glob from PIL import Imageimg_dir = "/home/chenxp/data/VOCdevkit/scenetext/JPEGImages"img_lists = glob.glob(img_dir + '/*.jpg')test_name_size = open('/home/chenxp/caffe/data/scenetext/test_name_size.txt', 'w')for item in img_lists:img = Image.open(item)width, height = img.sizetemp1, temp2 = os.path.splitext(os.path.basename(item))test_name_size.write(temp1 + ' ' + str(height) + ' ' + str(width) + '\n')

準備標簽映射文件 labelmap

這個 prototxt 文件是記錄 label 與 name 之間的對應關系的，內(nèi)容如下：

item {name: "none_of_the_above"label: 0display_name: "background" } item {name: "object"label: 1display_name: "text" }

我的 prototxt 文件名稱，被我重命名為：labelmap_voc.prototxt

生成 lmdb 數(shù)據(jù)庫

準備好上述的幾個文本文件，將其放置在如下位置：

/home/chenxp/caffe/data/scenetext

這時候，需要修改調(diào)用 SSD 源碼中提供的 create_data.sh 腳本文件（我將文件重命名為：create_data_scenetext.sh）：

cur_dir=$(cd $( dirname ${BASH_SOURCE[0]} ) && pwd ) root_dir=$cur_dir/../..cd $root_dirredo=1 data_root_dir="$HOME/data/VOCdevkit" dataset_name="scenetext" mapfile="$root_dir/data/$dataset_name/labelmap_voc_scenetext.prototxt" anno_type="detection" db="lmdb" min_dim=0 max_dim=0 width=0 height=0extra_cmd="--encode-type=jpg --encoded" if [ $redo ] thenextra_cmd="$extra_cmd --redo" fi for subset in test trainval dopython $root_dir/scripts/create_annoset.py --anno-type=$anno_type --label-map-file=$mapfile \--min-dim=$min_dim --max-dim=$max_dim --resize-width=$width --resize-height=$height \--check-label $extra_cmd $data_root_dir $root_dir/data/$dataset_name/$subset.txt \$data_root_dir/$dataset_name/$db/$dataset_name"_"$subset"_"$db examples/$dataset_name done

上面的 bash 腳本會自動將訓練的 ICDAR 2011 的圖像文件與對應 label 轉(zhuǎn)換為 lmdb 文件。轉(zhuǎn)換后的文件位置可參見上面腳本的內(nèi)容，我的位置為：

/home/chenxp/caffe/examples/scenetext_trainval_lmdb /home/chenxp/caffe/examples/scenetext_test_lmdb

訓練模型

將 SSD 用于自己的檢測任務，是需要 Fine-tuning a pretrained network 的。

具體的，需要加載 SSD 作者提供的 VGG_ILSVRC_16_layers_fc_reduced.caffemodel，在這個預訓練的模型上，繼續(xù)用我們的數(shù)據(jù)訓練。

下載下來后，放在如下位置下面：

/home/chenxp/caffe/models/VGGNet

之后，修改作者提供的訓練 Python 代碼：ssd_pascal.py，這份代碼會自動創(chuàng)建訓練所需要的如下幾個文件：

deploy.prototxt
solver.prototxt
trainval.prototxt
test.prototxt

我們需要按照自己的情況，修改如下幾處地方：

# Modify the job name if you want. job_name = "SSD_{}".format(resize) # The name of the model. Modify it if you want. model_name = "VGG_VOC0712_{}".format(job_name)# Directory which stores the model .prototxt file. save_dir = "models/VGGNet/VOC0712/{}".format(job_name) # Directory which stores the snapshot of models. snapshot_dir = "models/VGGNet/VOC0712/{}".format(job_name) # Directory which stores the job script and log file. job_dir = "jobs/VGGNet/VOC0712/{}".format(job_name) # Directory which stores the detection results. output_result_dir = "{}/data/VOCdevkit/results/VOC2007/{}/Main".format(os.environ['HOME'], job_name)# model definition files. train_net_file = "{}/train.prototxt".format(save_dir) test_net_file = "{}/test.prototxt".format(save_dir) deploy_net_file = "{}/deploy.prototxt".format(save_dir) solver_file = "{}/solver.prototxt".format(save_dir) # snapshot prefix. snapshot_prefix = "{}/{}".format(snapshot_dir, model_name) # job script path. job_file = "{}/{}.sh".format(job_dir, model_name)# Stores the test image names and sizes. Created by data/VOC0712/create_list.sh name_size_file = "data/VOC0712/test_name_size.txt" # The pretrained model. We use the Fully convolutional reduced (atrous) VGGNet. pretrain_model = "models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" # Stores LabelMapItem. label_map_file = "data/VOC0712/labelmap_voc.prototxt"num_classes = 21num_test_image = 4952

我的訓練參數(shù)

其實還需要修改一些，如訓練時的參數(shù)。因為一開始若直接用作者 ssd_pascal.py 文件中的默認的 solver.prototxt 參數(shù)，會出現(xiàn)如下情況：

跑著跑著，loss 就變成 nan 了，發(fā)散了，不收斂。

我調(diào)試了一段時間，我的 solver.prototxt 參數(shù)設置如下，可保證收斂：

base_lr: 0.0001

其余參數(shù)可看自己設置。學習率一定要小，原先的 0.001 就會發(fā)散。

訓練結(jié)束：

可以看見，最后的測試精度為 0.776573，感覺 SSD 效果還可以。

我自己訓練好的模型，上傳到云端了：鏈接：http://share.weiyun.com/1c544de66be06ea04774fd11e820a780 （密碼：ERid5Y）

這個需要在下一階段的測試中用到。

用訓練好的 model 進行 predict

SSD 的作者也給我們寫好了 predict 的代碼，我們只需要該參數(shù)就可以了。

用 jupyter notebook 打開 ~/caffe/examples/ssd_detect.ipynb 文件，這是作者為我們寫好的將訓練好的 caffemodel 用于檢測的文件。

指定好 caffemodel，deploy.txt，詳細的看我上傳的代碼吧。

測試幾張圖像，結(jié)果如下：

參考

ECCV2016 Paper: 《SSD: Single Shot MultiBox Detector》

SSD 源代碼

SSD-plate_detection from solace_hyh

SSD框架訓練自己的數(shù)據(jù)集

總結(jié)

以上是生活随笔為你收集整理的SSD: Signle Shot Detector 用于自然场景文字检测的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SSD 安装、训练、测试（ubuntu1
下一篇： OpenCV学习笔记（一）——OpenC