當(dāng)前位置：首頁(yè) >

使用Tesseract-OCR训练文字识别记录

發(fā)布時(shí)間：2025/7/25 34 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Tesseract-OCR训练文字识别记录小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

使用Tesseract-OCR訓(xùn)練文字識(shí)別記錄

作者毛毛卷彎彎

from: http://www.jianshu.com/p/5c8c6b170f6f

Tesseract官方文檔頁(yè)面

https://github.com/tesseract-ocr/tesseract

jTessBoxEditor官方文檔頁(yè)面

http://vietocr.sourceforge.net/training.html

jTessBoxEditor is a box editor and trainer for Tesseract OCR, providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 7 or later.

工具和環(huán)境準(zhǔn)備

Tesseract-OCR引擎
jTessBoxEditor用來(lái)訓(xùn)練字庫(kù)
Tesseract-OCR在centos 7中安裝，jTessBoxEditor安裝在win中

安裝Tesseract

之所以選擇在centos 7下安裝Tesseract，因?yàn)樵诖酥鞍惭b過(guò)win版本，和在centos 6編譯和yum安裝過(guò)，但是在使用過(guò)程中都會(huì)提示缺少某一部分內(nèi)容。
在centos 7下選擇了yum安裝。
在yum安裝前，需要epel源。
[root@docker01 yum.repos.d]# yum install epel-release
/etc/yum.repos.d目錄下就多了一個(gè)epel.repo文件
開始yum安裝Tesseract
[root@docker01 yum.repos.d]# yum install tesseract
這樣就會(huì)自動(dòng)解決一些依賴關(guān)系，省的到后面用的時(shí)候出現(xiàn)各種缺少文件提示。

依賴關(guān)系解決=======================================================================================================================================================================================Package 架構(gòu) 版本源大小 ======================================================================================================================================================================================= 正在安裝:tesseract x86_64 3.04.00-3.el7 epel 11 M 為依賴而安裝:cairo x86_64 1.14.2-1.el7 base 711 kfontconfig x86_64 2.10.95-7.el7 base 228 kfontpackages-filesystem noarch 1.44-8.el7 base 9.9 kgiflib x86_64 4.1.6-9.el7 base 40 kgraphite2 x86_64 1.3.6-1.el7_2 updates 112 kharfbuzz x86_64 0.9.36-1.el7 base 156 kjbigkit-libs x86_64 2.0-11.el7 base 46 kleptonica x86_64 1.72-2.el7 epel 928 klibICE x86_64 1.0.9-2.el7 base 65 klibSM x86_64 1.2.2-2.el7 base 39 klibX11 x86_64 1.6.3-2.el7 base 605 klibX11-common noarch 1.6.3-2.el7 base 162 klibXau x86_64 1.0.8-2.1.el7 base 29 klibXdamage x86_64 1.1.4-4.1.el7 base 20 klibXext x86_64 1.3.3-3.el7 base 39 klibXfixes x86_64 5.0.1-2.1.el7 base 18 klibXft x86_64 2.3.2-2.el7 base 58 klibXrender x86_64 0.9.8-2.1.el7 base 25 klibXxf86vm x86_64 1.1.3-2.1.el7 base 17 klibicu x86_64 50.1.2-15.el7 base 6.9 Mlibjpeg-turbo x86_64 1.2.90-5.el7 base 134 klibpng x86_64 2:1.5.13-7.el7_2 updates 213 klibthai x86_64 0.1.14-9.el7 base 187 klibtiff x86_64 4.0.3-25.el7_2 updates 169 klibwebp x86_64 0.3.0-3.el7 base 170 klibxcb x86_64 1.11-4.el7 base 189 klibxshmfence x86_64 1.2-1.el7 base 7.2 kmesa-libEGL x86_64 10.6.5-3.20150824.el7 base 74 kmesa-libGL x86_64 10.6.5-3.20150824.el7 base 184 kmesa-libgbm x86_64 10.6.5-3.20150824.el7 base 40 kmesa-libglapi x86_64 10.6.5-3.20150824.el7 base 39 kpango x86_64 1.36.8-2.el7 base 287 kpixman x86_64 0.32.6-3.el7 base 254 k

測(cè)試是否安裝成功

[root@docker01 tesseract]# tesseractUsage:tesseract imagename|stdin outputbase|stdout [options...] [configfile...]OCR options:--tessdata-dir /path specify the location of tessdata path--user-words /path/to/file specify the location of user words file--user-patterns /path/to/file specify the location of user patterns file-l lang[+lang] specify language(s) used for OCR-c configvar=value set value for control parameter.Multiple -c arguments are allowed.-psm pagesegmode specify page segmentation mode. These options must occur before any configfile.pagesegmode values are:0 = Orientation and script detection (OSD) only.1 = Automatic page segmentation with OSD.2 = Automatic page segmentation, but no OSD, or OCR3 = Fully automatic page segmentation, but no OSD. (Default)4 = Assume a single column of text of variable sizes.5 = Assume a single uniform block of vertically aligned text.6 = Assume a single uniform block of text.7 = Treat the image as a single text line.8 = Treat the image as a single word.9 = Treat the image as a single word in a circle.10 = Treat the image as a single character.Single options:-v --version: version info--list-langs: list available languages for tesseract engine. Can be used with --tessdata-dir.--print-parameters: print tesseract parameters to the stdout.

查看當(dāng)前有哪些語(yǔ)言環(huán)境

[root@docker01 tesseract]# tesseract --list-langsList of available languages (2): eng

就一個(gè)英語(yǔ)環(huán)境。
語(yǔ)言包所在的目錄

[root@docker01 tessdata]# pwd /usr/share/tesseract/tessdata [root@docker01 tessdata]# ll 總用量 37624 drwxr-xr-x 2 root root 4096 10月 25 22:51 configs -rw-r--r-- 1 root root 171918 6月 25 2015 eng.cube.bigrams -rw-r--r-- 1 root root 38 6月 25 2015 eng.cube.fold -rw-r--r-- 1 root root 181 6月 25 2015 eng.cube.lm -rw-r--r-- 1 root root 857304 6月 25 2015 eng.cube.nn -rw-r--r-- 1 root root 254 6月 25 2015 eng.cube.params -rw-r--r-- 1 root root 13020078 6月 25 2015 eng.cube.size -rw-r--r-- 1 root root 2444187 6月 25 2015 eng.cube.word-freq -rw-r--r-- 1 root root 996 6月 25 2015 eng.tesseract_cube.nn -rw-r--r-- 1 root root 21876550 6月 25 2015 eng.traineddata -rw-r--r-- 1 root root 124215 10月 25 23:08 normal.traineddata -rw-r--r-- 1 root root 568 1月 26 2016 pdf.ttf drwxr-xr-x 2 root root 92 10月 25 22:51 tessconfigs

后期若要添加語(yǔ)言包，則可下載語(yǔ)言包后放到這里面。

pkgs.org中對(duì)tesseract的安裝說(shuō)明，已經(jīng)安裝后的一些文件信息
https://pkgs.org/centos-7/epel-x86_64/tesseract-3.04.00-3.el7.x86_64.rpm.html

安裝jTessBoxEditor

jTessBoxEditor需要jre7（Java Runtime Environment）以上的版本支持。
安裝完jre后，下載jTessBoxEditor，解壓，運(yùn)行train.bat文件即可運(yùn)行

運(yùn)行后界面圖

至此兩個(gè)所需要的軟件安裝結(jié)束。

初步識(shí)別工作

準(zhǔn)備幾張圖片

把這幾張圖片傳到安裝tesseract的機(jī)器上

[root@docker01 test01]# ll 總用量 24 -rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif -rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif -rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif -rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif -rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif -rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif

開始識(shí)別0.gif圖片

[root@docker01 test01]# tesseract 0.gif out.0 -l eng Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory

這是在該目錄下多了一個(gè)out.0.txt文件

[root@docker01 test01]# ll 總用量 28 -rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif -rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif -rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif -rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif -rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif -rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif -rw-r--r-- 1 root root 6 10月 26 00:52 out.0.txt

查看所識(shí)別到的內(nèi)容

[root@docker01 test01]# cat out.0.txt [54v

和圖片上的I54v有點(diǎn)差別。

批量識(shí)別所有內(nèi)容

[root@docker01 test01]# for i in {1..5};do tesseract $i.gif out.$i -l eng;done Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory

查看識(shí)別出的內(nèi)容

[root@docker01 test01]# ll 總用量 48 -rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif -rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif -rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif -rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif -rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif -rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif -rw-r--r-- 1 root root 6 10月 26 00:52 out.0.txt -rw-r--r-- 1 root root 9 10月 26 01:00 out.1.txt -rw-r--r-- 1 root root 5 10月 26 01:00 out.2.txt -rw-r--r-- 1 root root 6 10月 26 01:00 out.3.txt -rw-r--r-- 1 root root 7 10月 26 01:00 out.4.txt -rw-r--r-- 1 root root 5 10月 26 01:00 out.5.txt [root@docker01 test01]# cat *.txt [54v ikhb‘ ymm 7y28 nl 9c mzb

和上面的圖片對(duì)應(yīng)，其實(shí)就一個(gè)3.gif圖片識(shí)別對(duì)了

訓(xùn)練工作

合成圖片工作

返回到win系統(tǒng)上，運(yùn)行jTessBoxEditor工具，把所有圖片合成一張.tif格式的圖片

打開所有要合成的圖片

命名要合成圖片的名字

注：有關(guān)這個(gè)命名有個(gè)說(shuō)法，必須要按以下格式命名
tif命名規(guī)范：
[lang].[fontname].exp[num].tif
其中l(wèi)ang為語(yǔ)言名稱，fontname為字體名稱，num為序號(hào)，可以隨便定義。
但我試了其他的明白，直接命名也是正常的。

提示創(chuàng)建成功，在圖片目錄下生成一個(gè)mytest.tif的文件

生成box文件工作

把mytest.tif文件上傳到centos 7 系統(tǒng)上

[root@docker01 04test]# ll總用量 100 -rw-r--r-- 1 root root 99212 10月 26 15:42 mytest.tif

在mytest.tif所在的目錄下打開一個(gè)命令行，產(chǎn)生相應(yīng)的Box文件（*.box）
來(lái)生成一個(gè)box文件，該文件記錄了tesseract識(shí)別出來(lái)的每一個(gè)字和其位置坐標(biāo)。

[root@docker01 04test]# tesseract mytest.tif mytest batch.nochop makeboxTesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 Page 2 Page 3 Empty page!! Empty page!! Empty page!! Page 4 Page 5 Page 6 Page 7 Empty page!! Empty page!! Empty page!! Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Empty page!! Empty page!! Empty page!! Page 18 Page 19 Page 20 Page 21 Empty page!! Empty page!! Empty page!! Warning in pixReadMemTiff: tiff page 21 not found

這時(shí)目錄多出了一個(gè)mytest.box和mytest.txt文件

[root@docker01 04test]# ll總用量 108 -rw-r--r-- 1 root root 1005 10月 26 23:52 mytest.box -rw-r--r-- 1 root root 99212 10月 26 15:42 mytest.tif -rw-r--r-- 1 root root 101 10月 26 23:52 mytest.txt

修正文字內(nèi)容

把mytest.box下載下來(lái)，放到win系統(tǒng)下，放到之前mytest.tif目錄下。

使用jTessBoxEditor開始修正文字

修正文字會(huì)遇到的幾種情況

普通情況

可以看到，識(shí)別到的第一個(gè)值是6，但圖片中的值為e，所以開始手動(dòng)修改

修改后，回車，然后點(diǎn)擊save保存

然后進(jìn)行一張圖片修正

若識(shí)別到的圖片的文字與圖片上一樣，即可繼續(xù)下一張圖片識(shí)別
表中無(wú)內(nèi)容

部分圖片可能由于背景顏色關(guān)系，導(dǎo)致此張圖片無(wú)法識(shí)別，可跳過(guò)繼續(xù)下一張識(shí)別。
識(shí)別一半
例如以下圖片，四個(gè)字符，只被分割成兩個(gè)

此時(shí)，可以用到分割識(shí)別框以及調(diào)整識(shí)別框位置的功能

調(diào)整后的圖形

Run Tesseract for Training

產(chǎn)生字符特征文件（*.tr）

把修正后的box文件傳回centos7系統(tǒng)中，刪除原來(lái)在centos 7系統(tǒng)中的box文件

[root@docker01 03test]# rm 200test.boxrm：是否刪除普通文件 "200test.box"？y [root@docker01 03test]# rz -byrz waiting to receive. Starting zmodem transfer. Press Ctrl+C to cancel. Transferring 200test.box...100% 9 KB 9 KB/sec 00:00:01 0 Errors [root@docker01 03test]# tesseract 200test.tif 200test nobatch box.train

目錄下都了一個(gè)tr文件

[root@docker01 03test]# ll總用量 1756 -rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box -rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif -rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr -rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt

Compute the Character Set

產(chǎn)生計(jì)算字符集（unicharset）

[root@docker01 03test]# unicharset_extractor 200test.boxExtracting unicharset from 200test.box Wrote unicharset file ./unicharset.

定義字體特征文件并聚集字符特征

新建文件“font_properties”。那么需要在目錄下新建一個(gè)名字為“font_properties”的文件，并且輸入文本 :
注意:這里 200test 必須與訓(xùn)練名中的名稱保持一致,填入下面內(nèi)容 ,這里全取值為0，表示字體不是粗體、斜體等等。

200test 0 0 0 0 0 [root@docker01 03test]# ll總用量 1764 -rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box -rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif -rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr -rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt -rw-r--r-- 1 root root 18 10月 27 01:02 font_properties -rw-r--r-- 1 root root 2301 10月 27 01:00 unicharset [root@docker01 03test]# cat font_properties200test 0 0 0 0 0

執(zhí)行命令：

[root@docker01 03test]# mftraining -F font_properties -U unicharset 200test.trWarning: No shape table file present: shapetable Reading 200test.tr ... Flat shape table summary: Number of shapes = 43 max unichars = 1 number with multiple unichars = 0 Warning: no protos/configs for Joined in CreateIntTemplates() Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates() Done!

輸入命令：

[root@docker01 03test]# cntraining 200test.trReading 200test.tr ... Clustering ...Writing normproto ...

此時(shí)，在目錄下應(yīng)該生成若干個(gè)文件了，把unicharset, inttemp, normproto, pffmtable這四個(gè)文件加上前綴“200test.”。然后合并訓(xùn)練文件

[root@docker01 03test]# ll總用量 2100 -rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box -rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif -rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr -rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt -rw-r--r-- 1 root root 18 10月 27 01:02 font_properties -rw-r--r-- 1 root root 323869 10月 27 01:03 inttemp -rw-r--r-- 1 root root 5342 10月 27 01:04 normproto -rw-r--r-- 1 root root 341 10月 27 01:03 pffmtable -rw-r--r-- 1 root root 778 10月 27 01:03 shapetable -rw-r--r-- 1 root root 2301 10月 27 01:00 unicharset

修改文件，并合并訓(xùn)練文件

[root@docker01 03test]# ll總用量 2100 -rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box -rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif -rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr -rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt -rw-r--r-- 1 root root 18 10月 27 01:02 font_properties -rw-r--r-- 1 root root 323869 10月 27 01:03 test200.inttemp -rw-r--r-- 1 root root 5342 10月 27 01:04 test200.normproto -rw-r--r-- 1 root root 341 10月 27 01:03 test200.pffmtable -rw-r--r-- 1 root root 778 10月 27 01:03 test200.shapetable -rw-r--r-- 1 root root 2301 10月 27 01:00 test200.unicharse

合并文件

[root@docker01 03test]# combine_tessdata test200.Combining tessdata files TessdataManager combined tesseract data files. Offset for type 0 (test200.config ) is -1 Offset for type 1 (test200.unicharset ) is 140 Offset for type 2 (test200.unicharambigs ) is -1 Offset for type 3 (test200.inttemp ) is 2441 Offset for type 4 (test200.pffmtable ) is 326310 Offset for type 5 (test200.normproto ) is 326651 Offset for type 6 (test200.punc-dawg ) is -1 Offset for type 7 (test200.word-dawg ) is -1 Offset for type 8 (test200.number-dawg ) is -1 Offset for type 9 (test200.freq-dawg ) is -1 Offset for type 10 (test200.fixed-length-dawgs ) is -1 Offset for type 11 (test200.cube-unicharset ) is -1 Offset for type 12 (test200.cube-word-dawg ) is -1 Offset for type 13 (test200.shapetable ) is 331993 Offset for type 14 (test200.bigram-dawg ) is -1 Offset for type 15 (test200.unambig-dawg ) is -1 Offset for type 16 (test200.params-model ) is -1 Output test200.traineddata created sucessfully.

此時(shí)目錄下“test200.traineddata”的文件拷貝到tesseract程序目錄下的“tessdata”目錄。

[root@docker01 03test]# cp test200.traineddata /usr/share/tesseract/tessdata

查看當(dāng)前語(yǔ)言包有哪些

[root@docker01 tesseract_test]# tesseract --list-langs List of available languages (4): eng normal myfont test200

至此，新的語(yǔ)言包已訓(xùn)練完成，下一步就是要用此語(yǔ)言包來(lái)識(shí)別圖形文字

再次識(shí)別

還是最開始的5漲圖片

[root@docker01 test01]# ll總用量 44 -rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif -rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif -rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif -rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif -rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif -rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif

用一個(gè)循環(huán)批量識(shí)別

[root@docker01 test01]# for i in {1..5};do tesseract $i.gif out.$i -l test200;done Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemGif: writing to a temp file, not directly to memory

識(shí)別后輸出的文件

[root@docker01 test01]# ll 總用量 48 -rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif -rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif -rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif -rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif -rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif -rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif -rw-r--r-- 1 root root 6 10月 27 01:18 out.0.txt -rw-r--r-- 1 root root 6 10月 27 01:18 out.1.txt -rw-r--r-- 1 root root 6 10月 27 01:18 out.2.txt -rw-r--r-- 1 root root 6 10月 27 01:18 out.3.txt -rw-r--r-- 1 root root 7 10月 27 01:18 out.4.txt -rw-r--r-- 1 root root 6 10月 27 01:18 out.5.txt

查看文件內(nèi)容，以及對(duì)比圖片

[root@docker01 test01]# cat out.* l54vikh6ynxn7y28nl 9cw4zb

圖片內(nèi)容

可以對(duì)比下最開始的識(shí)別情況，識(shí)別率大大提高了。

總結(jié)

以上是生活随笔為你收集整理的使用Tesseract-OCR训练文字识别记录的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： c++ 解析从浏览器端传过来的图像bas
下一篇： Go程序性能分析pprof