Train Tesseract LSTM with tesstrain.sh on Windows
Train Tesseract LSTM methods Comparison
Train Tesseract LSTM with make on Windows
How the makefile in tesstrain-win work
Win10 Tesseract4.1 LSTM training
tesstrain.sh是How to use the tools provided to train Tesseract 4.00举例用的训练工具,主要用于训练各类语言的新字体,来源于Tesseract源码(Tesseract/src/training)。经过验证,tesstrain.sh在Windows10下可用,本文介绍在Windows下使用tesstrain.sh训练新字体的步骤,文中使用的训练文件已上传至tesstrainsh-win。
环境说明
1. Win10 X64
本文在Win10+X64环境下已验证。
2. 源码编译的Tesseract4.1发布版
本文使用的是源码编译的Tesseract4.1发布版,Tesseract4.1源码编译的方法与Tesseract4.0类似,可以参考Tesseract4.0+VS2017+win10源码编译攻略。源码编译且安装完成后,训练用的相关工具也随之安装完成,无须其他额外操作。
这里给出我编译好的包含训练工具的Tesseract4.1的链接,需要的同学自取(The down load link of the compiled version 4.1 with training tools):https://drive.google.com/file/d/1ALfBsy5C2l9vJkJ_treAqCvcwS7cfoLX/view?usp=sharing
3. Git for Windows
Git for Windows安装完成后,将Git/bin所在路径添加到环境变量的Path路径中。
参考链接
How to use the tools provided to train Tesseract 4.00
Training/Fine Tuning Tesseract OCR LSTM for New Fonts(https://www.youtube.com/watch?v=TpD76k2HYms)
感谢原作者的分享。
tesstrainsh-win使用步骤
在How to use the tools provided to train Tesseract 4.00的介绍中,tesstrain.sh需要结合combine_tessdata、lstmtraining 等训练工具,在命令行中分别键入相应命令来完成Tesseract LSTM的训练。本文介绍的tesstrainsh-win项目将所有必要的训练命令写入同一个.sh文件,大家根据自己的需求更改tesstrainDone.sh文件中相应的参数,运行该文件即可。建议先用tesstrainsh-win已准备好的相应文件进行训练测试,若训练成功,表示当前电脑的训练环境可用,然后再训练自己准备的字体以及相应的文件。
1. 准备工作
1.1 将待训练的字体文件拷贝到tesstrain/fonts路径下,tesstrain.sh支持同时训练多个字体文件。本文以Impact.ttf为例,字体文件路径为tesstrain/fonts/Impact.ttf。
1.2 将待训练的字体所属语言相关文件拷贝到tesstrainsh-win/langdata_lstm下。例如本文训练的字库属于英语,将langdata_lstm/eng该文件夹下所有文件下载并放在tesstrainsh-win/langdata_lstm/eng路径下。假如您需要训练简体中文,则在tesstrainsh-win/langdata_lstm路径下新建chi_sim文件夹,将langdata_lstm/chi_sim下所有文件下载并放在tesstrainsh-win/langdata_lstm/chi_sim路径下。
1.3 将待训练的字体的基础字库拷贝到路径tesstrainsh-win/tessdata下,本文该路径下的文件为tesstrainsh-win/tessdata/eng.traineddata,这里的.traineddata需要从github/tesseract-ocr/tessdata_best项目中下载。
1.4 将lstm.train拷贝至路径tesstrainsh-win/tessdata/configs下。该文件在tesseract的安装路径下C:/Program Files/tesseract/tessdata/configs。我在tesstrainsh-win中已准备好该文件,但我的是Tesseract4.1版本,如果您使用的是其他版本的训练工具,建议该文件版本也保持一致,在安装路径下找到该文件拷贝到路径tesstrainsh-win/tessdata/configs下覆盖现有文件即可。
1.5 tesstrainsh-win项目下的tesstrain.sh、tesstrain_utils.sh、language-specific.sh这三个文件是从Tesseract源码(Tesseract/src/training)拷贝而来,是Tesseract4.1的release版本。如果您使用的是其他版本的训练工具,建议这三个文件版本也保持一致。在源码路径下找到这三个文件并拷贝到路径tesstrainsh-win下覆盖现有文件即可。
1.6 从langdata_lstm下载radical-stroke.txt并拷贝至路径tesstrainsh-win/langdata_lstm下。我在tesstrainsh-win已准备好该文件,radical-stroke.txt已有2年未更新,目前有LSTM的Tesseract版本应该均通用。
tesstrainsh-win项目的文件结构如下图所示:
2. 管理员身份打开命令提示符(cmd.exe),进入tesstrainsh-win所在路径。
3. sh tesstrainDone.sh
运行该命令时,可能会出现如下错误:
1 2 3 |
Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! ERROR: Program Program failed. Abort. |
出现该错误的原因:
Unix和Mac OS的换行符是LF,而Windows的换行符是CRLF。为了处理该项特征,Windows环境下,Git在提交代码时自动把行结束符CRLF转换成LF,而在签出代码或者Git clone到本地时把LF转换成CRLF。
但是在用sh运行.sh文件时,要求换行符是LF。故出现此错误。
解决方案:
将tesstrainsh-win/langdata_lstm/lang路径下的文件换行符改为LF。Windows批量修改文件换行符的方法有很多种,大家可以各显神通。
我用的方法比较繁笨拙:
用NotePad++逐一打开待修改文件—>编辑—>文件格式转换—>转为Unix(LF)—>保存。
该命令运行完成的提示内容如下:
1 2 |
Finished! Error rate = 0.864 Loaded file output/impact_checkpoint, unpacking... |
执行过程中,会在tesstrainsh-win路径下创建两个文件夹,分别是train和output。
train:训练过程中产生的中间文件均在此路径下,例如.box/.tif/.lstmf等等文件均在此文件夹中。
output:训练过程中产生的阶段性的checkpoint和效果最好的checkpoint,以及最终得到的Impact.traineddata均会放在此文件夹中。
4. sh eval.sh
这一步用lstmeval来评估训练结果,并不是新字体训练必须步骤,您可以忽略此步骤,或者通过其他方式来检验训练结果。
lstmeval命令的参数组合如下:
1 2 3 4 |
lstmeval \ --model lang.lstm|modelname_checkpoint|modelname_N.NN_NN_NN.checkpoint|lang.traineddata\ [--traineddata lang/lang.traineddata] \ --eval_listfile lang.eval_files.txt |
–model:用于设定语言模型文件名,可以是.lstm/.checkpoint/.traineddata文件。
如果是.lstm/.checkpoint文件,则可选参数–traineddata必须设定,且该文件是提供给lstmtraining训练的traineddata文件。
如果是 .traineddata文件,可选参数–traineddata不需要设定。
–eval_listfile:待评估文件列表。
用从tessdata_best下载的eng.traineddata评估时eval.sh中的代码
1 2 3 4 |
lstmeval \ --model train/eng.lstm \ --traineddata tessdata/eng.traineddata \ --eval_listfile train/eng.training_files.txt |
评估结果如下:
1 |
At iteration 0, stage 0, Eval Char error rate=2.4431753, Word error rate=7.782621 |
用新训练的字库的评估时eval.sh中的代码:
1 2 3 4 |
lstmeval \ --model output/Impact_checkpoint \ --traineddata tessdata/eng.traineddata \ --eval_listfile train/eng.training_files.txt |
评估结果如下:
1 |
At iteration 0, stage 0, Eval Char error rate=0.29844455, Word error rate=0.85348778 |
由评估结果可知,无论是Char error rate,还是Word error rate,训练得到的字库的错误率有明显下降。目前tesstrainsh-win设定的训练迭代次数为500,若将此迭代次数增加,错误率也会继续下降。
tesstrainsh-win参数说明
接下来,我们对tesstrainsh-win中tesstrainDone.sh所使用的命令及其参数进行说明。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
mkdir -p train rm -rf train/* sh tesstrain.sh \ --fonts_dir fonts\ --fontlist 'Impact Condensed' \ --lang eng \ --linedata_only \ --langdata_dir langdata_lstm \ --tessdata_dir tessdata \ --save_box_tiff \ --maxpages 10\ --output_dir train mkdir -p output rm -rf output/* combine_tessdata -e tessdata/eng.traineddata train/eng.lstm lstmtraining \ --continue_from train/eng.lstm \ --model_output output/Impact \ --traineddata tessdata/eng.traineddata \ --train_listfile train/eng.training_files.txt \ --debug_interval -1\ --max_iterations 500 lstmtraining --stop_training \ --continue_from output/impact_checkpoint \ --traineddata tessdata/eng.traineddata \ --model_output output/Impact.traineddata |
下面我们逐一说明tesstrainDone.sh中所使用命令的参数意义。
tesstrain.sh参数意义说明
–fonts_dir:Path to font files
存放字体文件的路径。
–fontlist:A list of fontnames to train on.
待训练字体的名称列表,tesstrain.sh支持同时训练多个字体。Windows10环境下可选择的字体列表可用以下命令查询:
1 |
text2image --list_available_fonts --fonts_dir C:/windows/fonts |
–lang:ISO 639 code.
待训练字体所属语言的编码。
–linedata_only:Only generate training data for lstmtraining.
仅生成供lstmtraining使用的训练数据。若未设定该参数,那么tesstrain.sh将生成非LSTM用的数据。
–langdata_dir :Path to tesseract/training/langdata directory.
若lang设定为eng,则该参数需要设定的是langdata_lstm/eng训练所需文件在本地电脑中的路径。
–tessdata_dir:Path to tesseract/tessdata directory.
指定现有训练数据文件字库的路径,训练过程中特征提取时用到。 如果未指定,将使用TESSDATA_PREFIX中定义的路径。
–save_box_tiff:Save box/tiff pairs along with lstmf files.
保存训练过程中生成的.box和.tiff文件。
–maxpages :Specify maximum pages to output (default:0=all).
限制training_text文件的大小,避免训练过程中占用电脑过多内存。
–output_dir:Location of output traineddata file.
训练过程中生成的阶段性文件。
上述参数是tesstrainDone.sh文件中tesstrain.sh所设定参数的说明,tesstrain.sh完整的参数说明可以打开tesstrainsh-win/tesstrain.sh进行查阅。
还有其他我们tesstrainDone.sh未用到的参数,在这里记录一下,后续可以根据具体需求进行设定。
–my_boxtiff_dir:Location of user specified box/tiff files.
使用者可以指定已准备好的box/tiff文件路径,但文件的命名格式需遵循${LANG_CODE}.${fontname}.exp${EXPOSURE}.box/tif。从代码上看,设置该参数之后,自己准备的.box/.tif文件与text2image生成的.box/.tif同时参与训练,并不是只训练.box/.tif文件。若需要仅仅训练.box/.tif文件,需要对tesstrain.sh进行改造,或者手动单独输入各项命令。
–training_text:Text to render and use for training.
指定待训练文字或字符的文件路径,若未指定,将从–langdata_dir设定路径寻找。该参数会赋值给text2image命令,相当于Win10 Tesseract4.1 LSTM training中text2image命令用到的–text参数。
简单的说,我们训练时可以自行准备并指定想要训练的字符集。
–wordlis:Word list for the language ordered by decreasing frequency.
使用语言按使用频率降序排列的单词表,可以根据需求自行准备。
–xsize:Specify width of output image (default:3600).
设定text2Image输出图像的宽度,默认为3600。
combine_tessdata
combine_tessdata完整的使用说明可以参考这里combine_tessdata。在tesstrainDone.sh用到此命令的作用是从eng.traineddata提取eng.lstm文件,供lstmtraining 训练使用。
lstmtraining
lstmtraining 完整的使用说明可以参考这里lstmtraining 。在tesstrainDone.sh用到此命令的作用与Win10 Tesseract4.1 LSTM training文中大致相同,这里不再详述。
本文到此结束,感谢阅读,谢谢支持。
Hi…Thanks for the amazing article:
However i get the below error when i run the tesstrainsh-win.
============================================================
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master
$ sh tesstrainDone.sh
=== Starting training for language ‘eng’
which: no text2image in (/g//bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/g/bin:/c/Program Files (x86)/Microsoft SDKs/Azure/CLI2/wbin:/c/Program Files/Microsoft MPI/Bin:/c/Program Files (x86)/Common Files/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files (x86)/ATI Technologies/ATI.ACE/Core-Static:/c/Program Files/TortoiseSVN/bin:/c/Program Files/Microsoft SQL Server/130/Tools/Binn:/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/170/Tools/Binn:/c/Program Files/helm/windows-amd64:/c/Program Files/nodejs:/c/Program Files/dotnet:/cmd:/c/Python27/Scripts:/c/Program Files/poppler-0.68.0_x86/poppler-0.68.0/bin:/c/Program Files/PuTTY:/c/Users/kswapnil/AppData/Local/Programs/Python/Python38-32/Scripts:/c/Users/kswapnil/AppData/Local/Programs/Python/Python38-32:/c/Users/kswapnil/AppData/Local/Microsoft/WindowsApps:/c/Users/kswapnil/.dotnet/tools:/c/Users/kswapnil/AppData/Roaming/npm:/usr/bin/vendor_perl:/usr/bin/core_perl)
which: no text2image in (./api)
which: no text2image in (./training)
ERROR: ‘text2image’ not found
tesstrainDone.sh: line 21: combine_tessdata: command not found
tesstrainDone.sh: line 26: lstmtraining: command not found
tesstrainDone.sh: line 35: lstmtraining: command not found
=========================================
I just followed the instructions mentioned and didnt happen to change anything in the tesstrainsh-win file.
The Tesseract in my machine is located in “C:\Users\kswapnil\AppData\Local\Tesseract-OCR”
I placed all the files in the same location and ran again..However i get the same error.I checked and i do have text2image file in the tesseract directory.
Please let me know what im missing out on.
Thanks in Advance
Regards,
Swapnil
Hi, Could you enter the following commands in the command prompt:
tesseract –version
then tell me the display content Please.
Hi,
It says :
tesseract v5.0.0-alpha.20200328
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0
==========================================
I guess i have version 5..would u want me to use version 4 and give it a try?
/g//bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/g/bin:/c/Program Files (x86)/Microsoft SDKs/Azure/CLI2/wbin:/c/Program Files/Microsoft MPI/Bin:/c/Program Files (x86)/Common Files/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files (x86)/ATI Technologies/ATI.ACE/Core-Static:/c/Program Files/TortoiseSVN/bin:/c/Program Files/Microsoft SQL Server/130/Tools/Binn:/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/170/Tools/Binn:/c/Program Files/helm/windows-amd64:/c/Program Files/nodejs:/c/Program Files/dotnet:/cmd:/c/Python27/Scripts:/c/Program Files/poppler-0.68.0_x86/poppler-0.68.0/bin:/c/Program Files/PuTTY:/c/Users/kswapnil/AppData/Local/Programs/Python/Python38-32/Scripts:/c/Users/kswapnil/AppData/Local/Programs/Python/Python38-32:/c/Users/kswapnil/AppData/Local/Microsoft/WindowsApps:/c/Users/kswapnil/.dotnet/tools:/c/Users/kswapnil/AppData/Roaming/npm:/usr/bin/vendor_perl:/usr/bin/core_perl
==========================================
The above contents should be in your PATH. Are there? Try to put the installation path of tesseract in the environment variable PATH.
This should be because you did not put the installation path of tesseract in the environment variable PATH. The .sh file could not find tesseract in your PATH.
Okay! Do you know where do i put the Tesseract Path ?
Do as the way you set “c/Users/kswapnil/.dotnet/tools”.
1.Right-click the Computer icon and choose Properties, or in Windows Control Panel, choose System.
2.Choose Advanced system settings.
3.On the Advanced tab, click Environment Variables.
4. Set “C:\Users\kswapnil\AppData\Local\Tesseract-OCR\bin” to your Path.
Make sute that the text2image.exe, lstmtraining.exe,combine_tessdata.exe etc. files in your “C:\Users\kswapnil\AppData\Local\Tesseract-OCR\bin”.
Gotcha!!I will add the path to the environment variable.Strangely there is no bin folder in the Tesseract OCR installation Directory.I will manually create the Bin Folder and place those files instead.
Is your Tesseract compiled by yourself, or from the installed version of UB-Mannheim?
if it is the version of UB-Mannheim, please make sure that the text2image.exe, lstmtraining.exe, combine_tessdata.exe, combine_lang_model.exe, tesseract.exe, wordlist2dawg.exe, dawg2wordlist.exe files are in your Tesseract.
I installed Tesseract from UB-Manheim.
I added the Tesseract Path
Below are my path variables
===================================================
Microsoft Windows [Version 10.0.17134.1425]
(c) 2018 Microsoft Corporation. All rights reserved.
G:\>SET
ALLUSERSPROFILE=C:\ProgramData
APPDATA=C:\Users\kswapnil\AppData\Roaming
asl.log=Destination=file
CLIENTNAME=INEL2LPHXVGBX1
CommonProgramFiles=C:\Program Files\Common Files
CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files
CommonProgramW6432=C:\Program Files\Common Files
COMPUTERNAME=INEL2WK17TX022
ComSpec=C:\WINDOWS\system32\cmd.exe
DriverData=C:\Windows\System32\Drivers\DriverData
HOMEDRIVE=G:
HOMEPATH=\
HOMESHARE=\\***********\Users\Home1\kswapnil
LOCALAPPDATA=C:\Users\kswapnil\AppData\Local
LOGONSERVER=\\********
MSMPI_BIN=C:\Program Files\Microsoft MPI\Bin\
NUMBER_OF_PROCESSORS=8
OneDrive=C:\Users\kswapnil\OneDrive – *******
OneDriveCommercial=C:\Users\kswapnil\OneDrive – ********
OPENCV_DIR=C:\OpenCV\opencv\build
OS=Windows_NT
Path=C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin;C:\Program Files\Microsoft MPI\Bin\;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files\TortoiseSVN\bin;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\helm\windows-amd64;C:\Program Files\nodejs\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Python27\Scripts;C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin;C:\Program Files\PuTTY\;C:\Users\kswapnil\AppData\Local\Tesseract-OCR\bin\;C:\Users\kswapnil\AppData\Local\Programs\Python\Python38-32\Scripts\;C:\Users\kswapnil\AppData\Local\Programs\Python\Python38-32\;C:\Users\kswapnil\AppData\Local\Microsoft\WindowsApps;C:\Users\kswapnil\.dotnet\tools;C:\Users\kswapnil\AppData\Roaming\npm
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
PROCESSOR_ARCHITECTURE=AMD64
PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
PROCESSOR_LEVEL=6
PROCESSOR_REVISION=3c03
ProgramData=C:\ProgramData
ProgramFiles=C:\Program Files
ProgramFiles(x86)=C:\Program Files (x86)
ProgramW6432=C:\Program Files
PROMPT=$P$G
PSDistrict_BootImageVersion=US200364
PSDistrict_DeploymentID=US200348
PSDistrict_InstallationDate=20190211-22:25:51
PSDistrict_InstallationMethod=PXE
PSDistrict_SiteCode=US2
PSDistrict_TattooScriptVersion=1.4.3
PSDistrict_TSID=US200348
PSDistrict_TSName=Windows 10 Enterprise (1803)
PSModulePath=C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
PUBLIC=C:\Users\Public
SESSIONNAME=RDP-Tcp#8
snow_agent=C:\Program Files\Snow Software\Inventory\Agent
SNOW_INVENTORY_HOME=C:\Program Files\INVENTORYCLIENT\
SystemDrive=C:
SystemRoot=C:\WINDOWS
TEMP=C:\Users\kswapnil\AppData\Local\Temp
TMP=C:\Users\kswapnil\AppData\Local\Temp
UATDATA=C:\WINDOWS\CCM\UATData\D9F8C395-CAB8-491d-B8AC-179A1FE1BE77
UIPATH_LANGUAGE=en
UIPATH_USER_SERVICE_PATH=C:\Users\kswapnil\AppData\Local\UiPath\app-20.4.0-beta0472\UiPath.Service.UserHost.exe
USERDNSDOMAIN=IN.MOOG.COM
USERDOMAIN=IN
USERDOMAIN_ROAMINGPROFILE=IN
USERNAME=kswapnil
USERPROFILE=C:\Users\kswapnil
windir=C:\WINDOWS
==================================
I’m on a windows 10 machine hence,im trying to run tesstrainDone.sh directly
Below is the error i get now when i run the same.
===================================
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master
$ sh tesstrainDone.sh
=== Starting training for language ‘eng’
[Wed, May 6, 2020 8:42:55 AM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/bin/text2image –fonts_dir=fonts –font=Impact Condensed –outputbase=/tmp/font_tmp.r3xnM3fy7v/sample_text.txt –text=/tmp/font_tmp.r3xnM3fy7v/sample_text.txt –fontconfig_tmpdir=/tmp/font_tmp.r3xnM3fy7v
ERROR: Program text2image failed. Abort.
C:/Users/kswapnil/AppData/Local/Tesseract-OCR/bin/combine_tessdata.exe: error while loading shared libraries: libtesseract-5.dll: cannot open shared object file: No such file or directory
C:/Users/kswapnil/AppData/Local/Tesseract-OCR/bin/lstmtraining.exe: error while loading shared libraries: liblept-5.dll: cannot open shared object file: No such file or directory
C:/Users/kswapnil/AppData/Local/Tesseract-OCR/bin/lstmtraining.exe: error while loading shared libraries: liblept-5.dll: cannot open shared object file: No such file or directory
=======================================================
If i run tesstrain.sh then i get the below error:
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master
$ sh tesstrain.sh
tesstrain.sh: line 19: LANG_CODE: unbound variable
================================================================
I’m not sure what i’ m missing on…
I think that maybe the Tesseract from UB-Manheim don’t contain the complete training tools. The warning information shows that could not find the necessary dll files.
Could you give me your email? I will send the compiled version 4.1 to you. you could try it again.
Sure…My Email is swapnil.kadam950@gmail.com
Please check your email. And tell me the test results in here please. Thanks.
Recieved the email..will follow the instructions and will update you..Thank you so much for the help
Hi…I followed the instructions and ran the steps accordingly.
Below is the output that i recieve after i ran sh tesstrain.sh
At iteration 128/500/500, Mean rms=0.386%, delta=0.253%, char train=0.806%, word train=2.389%, skip ratio=0%, New best char error = 0.806 wrote best model:output/Impact0.806_128.checkpoint wrote checkpoint.
Finished! Error rate = 0.806
Loaded file output/impact_checkpoint, unpacking…
===================================
What is the accuracy % after i ran the file? I’m assuming the error rate is 0.8%…
===================================
Also i ran eval.sh….below is the final output after running it…
At iteration 0, stage 0, Eval Char error rate=2.4381503, Word error rate=7.7839609
===========================
The Word error Rate seems to be high..
Can you please tell me who do i increase the accuracy of the OCR?
I be very happy if you send it to me to.
Can’t donwload from you link, need register on WeChat onther possibiliies to downoad authorize via github or smth else not working atm
have same issue 🙁
1. the eval.sh in the tesstrainsh-win is to test the eng.traineddata, not the new traineddata. To test the new traineddata, you should updated the code in eval,sh according my blog or the Readme in Github.
The test results new .traineddata should be :
At iteration 0, stage 0, Eval Char error rate=0.29844455, Word error rate=0.85348778
2. If you want to improve the error rate, you could change the parameter “max_iterations” in tesstrainDone.sh to larger.
–max_iterations 500
The training result according to the change rule of this parameter can refer to here:https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
Hi..I followed the instructions and performed around 800 iterations.
Below is the output recieved after i run eval.sh:
Truth:Login STOCKS ON that manufacturers Oldham’s Lucia Damaging 07MIN The =>
OCR :Login STOCKS ON that manufacturers Oldham’s Lucia Damaging O7MIN The =>
At iteration 0, stage 0, Eval Char error rate=0.10777742, Word error rate=0.27715935
=========================================================
I guess the error rate has decreased a lot:
1)So I believe if i need to train on another font …i have to place the font file in the Fonts folder.
2)Change this parameter ” –fontlist ‘Impact Condensed’ \” in tesstrainDone.sh
Are my steps correct?
Also with the latest trained data i have problems with 1,i,l,] the OCR doesnt recognize it correctly..do u have any suggestions?
Regards,
Swapnil
1. Your steps to train another font are right. you shoud modified the parameters as the comments in tesstrainDone.sh on my github.
2. About 1,i,l,] , maybe you could find answer from tesseract issues or google group, I have no good suggestions.
But when you use the traineddata to OCR, you could set whitelist or blacklist to get good results If conditions permit.
Hi i tried training the OCR on the PUBG Font…I followed the instructions but when i run the tesstrainDone.sh file,it gave me an error telling “Please correct –font arg”.I followed all the steps mentioned.
==============================================
Contents in the fonts folder:
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master/fonts
$ ls
Pubg.ttf
=============================================================================
Contents of tesstraindone.sh
mkdir -p train
rm
-rf train
/*
#Update the value of fontlist according to training fonts name
#Update the value of lang according to training language
sh tesstrain.sh
\
–fonts_dir fonts\
–fontlist ‘Pubg’ \
–lang eng \
–linedata_only \
–langdata_dir langdata_lstm \
–tessdata_dir tessdata \
–save_box_tiff \
–maxpages 10\
–output_dir train
mkdir -p output
rm -rf output/*
#Update eng.traineddata and eng.lstm according to training language to lang.traineddata and lang.lstm
combine_tessdata -e tessdata/eng.traineddata train/eng.lstm
#Update eng.lstm/eng.traineddata/eng.training_files.txt according to training language
#Update output/Impact according to training font name
lstmtraining \
–continue_from train/eng.lstm \
–model_output output/Pubg \
–traineddata tessdata/eng.traineddata \
–train_listfile train/eng.training_files.txt \
–debug_interval -1\
–max_iterations 800
#Update eng.traineddata according to training language
#Update impact_checkpoint and Impact.traineddata according to training font name
lstmtraining –stop_training \
–continue_from output/Pubg_checkpoint \
–traineddata tessdata/eng.traineddata \
–model_output output/Pubg.traineddata
==============================================================
Below is the error i recieve when i run the tesstrainDone.sh:
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master
$ sh tesstrainDone.sh
=== Starting training for language ‘eng’
[Thu, May 7, 2020 12:31:22 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/text2image –fonts_dir=fonts –font=Pubg –outputbase=/tmp/font_tmp.1kjYnEGfls/sample_text.txt –text=/tmp/font_tmp.1kjYnEGfls/sample_text.txt –fontconfig_tmpdir=/tmp/font_tmp.1kjYnEGfls
Could not find font named ‘Pubg’.
Pango suggested font ‘Headliner No. 45’.
Please correct –font arg.
ERROR: Program text2image failed. Abort.
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
Extracting tessdata components from tessdata/eng.traineddata
Wrote train/eng.lstm
Failed to load list of training filenames from train/eng.training_files.txt
Failed to read continue from: output/Pubg_checkpoint
======================================================================
Is there something that I’m missing out?
1. It says that:
Could not find font named ‘Pubg’.
Pango suggested font ‘Headliner No. 45’.
Please correct –font arg.
2. The https://tesseract-ocr.github.io/tessdoc/Fonts says that:
Tesseract training can use images made from text which was rendered with a list of fonts. Those fonts must be available on the host where the training process is running.
3. So I think that the Pubg is not available on your PC. you could check that the available fonts on your PC by the following command:
text2image –list_available_fonts –fonts_dir C:/windows/fonts
4. if you just want to test other fonts. you could its suggestion:
Pango suggested font ‘Headliner No. 45’.
Thank you so much.
We have to place the new font in our font directory as well..that is c:Windows/Fonts
I placed the new font and tried running tesstrainDone.sh
I receive the below error:
=== Moving lstmf files for training data ===
Moving /tmp/eng-2020-05-08.gWF/eng.Headliner_No._45.exp0.lstmf to train
Created starter traineddata for LSTM training of language ‘eng’
Run ‘lstmtraining’ command to continue LSTM training for language ‘eng’
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
Extracting tessdata components from tessdata/eng.traineddata
Wrote train/eng.lstm
Must provide a –traineddata see training wiki
Must provide a –traineddata see training wiki
======================
do u know what trained data i need to provide?
or Am i missing out on something ?
Regards,
Swapnil M Kadam
You are welcome.
Please show your the wrong tesstrainDone.sh in here.
Below is the error report:
==============================
kswapnil@INEL2WK17TX022 MINGW64 /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesstrainsh-win-master
$ sh tesstrainDone.sh
=== Starting training for language ‘eng’
[Fri, May 8, 2020 4:32:49 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/text2image –fonts_dir=fonts –font=Headliner No. 45 –outputbase=/tmp/font_tmp.iZrAoqWpZO/sample_text.txt –text=/tmp/font_tmp.iZrAoqWpZO/sample_text.txt –fontconfig_tmpdir=/tmp/font_tmp.iZrAoqWpZO
Rendered page 0 to file C:/Users/kswapnil/AppData/Local/Temp/font_tmp.iZrAoqWpZO/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Headliner No. 45
[Fri, May 8, 2020 4:33:02 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/text2image –fontconfig_tmpdir=/tmp/font_tmp.iZrAoqWpZO –fonts_dir=fonts –strip_unrenderable_words –leading=32 –xsize=3600 –char_spacing=0.0 –exposure=0 –outputbase=/tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0 –max_pages=10 –font=Headliner No. 45 –text=langdata_lstm/eng/eng.training_text
Stripped 25 unrenderable words
Rendered page 0 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 36 unrenderable words
Rendered page 1 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 41 unrenderable words
Rendered page 2 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 30 unrenderable words
Rendered page 3 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 26 unrenderable words
Rendered page 4 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 36 unrenderable words
Rendered page 5 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 29 unrenderable words
Rendered page 6 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 19 unrenderable words
Rendered page 7 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 27 unrenderable words
Rendered page 8 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
Stripped 30 unrenderable words
Rendered page 9 to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Fri, May 8, 2020 4:33:21 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/unicharset_extractor –output_unicharset /tmp/eng-2020-05-08.BVn/eng.unicharset –norm_mode 1 /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.box
Extracting unicharset from box file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.box
Other case É of é is not in unicharset
Wrote unicharset file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.unicharset
[Fri, May 8, 2020 4:33:24 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/set_unicharset_properties -U /tmp/eng-2020-05-08.BVn/eng.unicharset -O /tmp/eng-2020-05-08.BVn/eng.unicharset -X /tmp/eng-2020-05-08.BVn/eng.xheights –script_dir=langdata_lstm
Loaded unicharset of size 101 from file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 69 = ~
Writing unicharset to file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=tessdata
[Fri, May 8, 2020 4:33:25 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/tesseract /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0 –psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Page 1
Page 2
Loaded 55/55 lines (1-55) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 3
Loaded 110/110 lines (1-110) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 4
Loaded 165/165 lines (1-165) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 5
Loaded 220/220 lines (1-220) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 6
Loaded 275/275 lines (1-275) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 7
Loaded 330/330 lines (1-330) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 8
Loaded 385/385 lines (1-385) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 9
Loaded 440/440 lines (1-440) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
Page 10
Loaded 495/495 lines (1-495) of document C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf
=== Constructing LSTM training data ===
[Fri, May 8, 2020 4:33:50 PM] /c/Users/kswapnil/AppData/Local/Tesseract-OCR/tesseract/bin/combine_lang_model –input_unicharset /tmp/eng-2020-05-08.BVn/eng.unicharset –script_dir langdata_lstm –words langdata_lstm/eng/eng.wordlist –numbers langdata_lstm/eng/eng.numbers –puncs langdata_lstm/eng/eng.punc –output_dir train –lang eng
Loaded unicharset of size 101 from file C:/Users/kswapnil/AppData/Local/Temp/eng-2020-05-08.BVn/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 69 = ~
Config file is optional, continuing…
Failed to read data from: langdata_lstm/eng/eng.config
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
=== Saving box/tiff pairs for training data ===
Moving /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.box to train
Moving /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.tif to train
=== Moving lstmf files for training data ===
Moving /tmp/eng-2020-05-08.BVn/eng.Headliner_No._45.exp0.lstmf to train
Created starter traineddata for LSTM training of language ‘eng’
Run ‘lstmtraining’ command to continue LSTM training for language ‘eng’
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
Extracting tessdata components from tessdata/eng.traineddata
Wrote train/eng.lstm
Must provide a –traineddata see training wiki
Must provide a –traineddata see training wiki
================================
Contents of tesstrainDone.sh:
mkdir -p train
rm -rf train/*
#Update the value of fontlist according to training fonts name
#Update the value of lang according to training language
sh tesstrain.sh \
–fonts_dir fonts\
–fontlist ‘Headliner No. 45’ \
–lang eng \
–linedata_only \
–langdata_dir langdata_lstm \
–tessdata_dir tessdata \
–save_box_tiff \
–maxpages 10\
–output_dir train
mkdir -p output
rm -rf output/*
#Update eng.traineddata and eng.lstm according to training language to lang.traineddata and lang.lstm
combine_tessdata -e tessdata/eng.traineddata train/eng.lstm
#Update eng.lstm/eng.traineddata/eng.training_files.txt according to training language
#Update output/Impact according to training font name
lstmtraining \
–continue_from train/eng.lstm \
–model_output output/Headliner No. 45 \
–traineddata tessdata/eng.traineddata \
–train_listfile train/eng.training_files.txt \
–debug_interval -1\
–max_iterations 800
#Update eng.traineddata according to training language
#Update impact_checkpoint and Impact.traineddata according to training font name
lstmtraining –stop_training \
–continue_from output/Headliner No. 45_checkpoint \
–traineddata tessdata/eng.traineddata \
–model_output output/Headliner No. 45.traineddata
==================
Regards,
Swapnil
1. change the
–model_output output/Headliner No. 45 \
to
–model_output output/Headliner\
2. change the
–continue_from output/Headliner No. 45_checkpoint \
to
–continue_from output/Headliner_checkpoint \
try again please.
Thank you for your inputs..IT worked..I’m currrently training on the HeadLiner font…Will train and see how it works out.
Regards,
Swapnil
You are welcome.
I be very happy if you send it to me to.
Can’t donwload from you link, need register on WeChat onther possibiliies to downoad authorize via github or smth else not working atm
have same issue 🙁
I be very happy if you send it to me to.
My mail: dilvish.john@gmail.com
Can’t donwload from you link, need register on WeChat onther possibiliies to downoad authorize via github or smth else not working atm
have same issue 🙁
Do you mean the tesseract compiled version 4.1?
https://drive.google.com/file/d/1ALfBsy5C2l9vJkJ_treAqCvcwS7cfoLX/view?usp=sharing
The google drive link.
如果训练chi_sim,里面不含”勐,箐”,发现langdata_lstm/chi_sim 也不含这些,该如何添加这些文字进去。请解答。
您好,把未被包含的文字放入chi_sim/chi_sim.unicharset和你自己准备的chi_sim.training_text文件中。
首先非常感谢您,热心解答。
因为网络没原因,度娘无法查询到很多有用资料。
1.因需兼容win(xp,2008)_c++,目前在win7系统,如果在Centos-linux训练出来,能否移植到windows使用。
2.这是一份langdata_lstm/chi_sim,仅修改chi_sim.unicharset和chi_sim.training_text就可以了?涉及【汉字和字母,符号】量比较少,是否直接缩减chi_sim.ucharset。
chi_sim.config
chi_sim.numbers
chi_sim.punc
chi_sim.singles_text
chi_sim.training_text
chi_sim.unicharambigs
chi_sim.unicharset
chi_sim.wordlist
desired_characters
forbidden_characters
okfonts.txt
3.chi_sim.training_text 这些格式说明,可以在哪里查到。
髋 1 0,255,0,255,0,0,0,0,0,0 Han 4020 0 4020 髋 # 髋 [9acb ]x
% 10 0,255,0,255,0,0,0,0,0,0 Common 4021 4 4021 % # % [25 ]p
您好:
抱歉补充一个点,使用githup下载的tessdata_best/chi_sim.traindata,在执行训练时命令时,出现的错误,使用上面的训练方法,能否解决这个问题。
lstmtraining –traineddata=”C:\tmp\zy\chi_sim.traineddata” –model_output=”C:\tmp\zy\output” –continue_from=”C:\tmp\zy\chi_sim.lstm” –train_listfile=”C:\tmp\zy\chi_sim.training_files.txt” –eval_listfile=”C:\tmp\zy\chi_sim.training_files.txt” –net_spec ‘[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]’ –debug_interval -1 –max_iterations 3000
——–报错:
Can’t encode transcription: ‘勐’ in language ”
Can’t encode transcription: ‘箐’ in language ”
@愤怒的小马:
1. 我没有试过,不能给出准确的答案,但我觉得可以。
2. 只需要修改chi_sim.unicharset【待训练字符的编码】和chi_sim.training_text【待训练文本内容】。我不确定是否能缩减至chi_sim.unicharset,你可以尝试一下。
3. 你列举的内容应该是chi_sim.ucharset中的内容。我不清楚在哪里有详细的介绍,或者可以到Tesseract源码或者相关issues中找一下。
另外观察一下chi_sim.ucharset中的内容,你可能并不需要弄清楚每一项的意义,主要确认你需要添加的汉字的编码即可。
chi_sim.ucharset所使用的应该是Unicode的汉字编码,你可以在chi_sim.ucharset找几个汉字作为研究对象,确认一下它们用的是utf8还是utf32,然后再添加你所需要的汉字。
4. 报这个错误的原因是你的chi_sim.ucharset中未包含这些字符的编码。类似的这些错误在Tesseract的issues中有相关讨论。
您好!!
通过下面的方法实现训练,还是报这样的错误。
修改了chi_sim.training_text 和chi_sim.unicharset。将
我猜测,是不是tessdata_best的chi_sim.traineddata ,执行combine_tessdata -e /root/chi_sim/tessdata/chi_sim.traineddata /root/chi_sim/lstmf/chi_sim.lstm引起的。
——————————————训练方法
./tesstrain.sh –fonts_dir /usr/share/fonts/chinese –fontlist ‘SimHei’ –lang chi_sim –linedata_only –save_box_tiff –noextract_font_properties –langdata_dir /root/chi_sim/langdata_lstm –tessdata_dir /root/chi_sim/tessdata –output_dir /root/chi_sim/lstmf
combine_tessdata -e /root/chi_sim/tessdata/chi_sim.traineddata /root/chi_sim/lstmf/chi_sim.lstm
lstmtraining –continue_from /root/chi_sim/lstmf/chi_sim.lstm –model_output /root/chi_sim/lstmf/SimHei –traineddata /root/chi_sim/tessdata/chi_sim.traineddata –train_listfile /root/chi_sim/lstmf/chi_sim.training_files.txt –debug_interval -1 –max_iterations 500
———————— 报错
Can’t encode transcription: ‘调10.02.2 220KV元江变10.0.2.4 伊萨河一级水电站10.0.2.22 35KV勐仰变10.0.2.14 伊萨河二级水电站10.0.2.23 南溪河电站10.0.2.15 110KV澧江变’ in language ”
Encoding of string failed! Failure bytes: ffffffe5 ffffff9e ffffffa4 ffffffe5 ffffff8f ffffff98 31 20 30 2e 30 32 2e 31 31 20 ffffffe5 ffffff85 ffffff83 ffffffe6 ffffffb1 ffffff9f ffffffe5 ffffff8e ffffffbf ffffffe8 ffffffb0 ffffff83 31 30 2e 30 2e 32 2e 32 20 33 35 4b 56 ffffffe7 ffffffbe ffffff8a ffffffe8 ffffffa1 ffffff97 ffffffe5 ffffff8f ffffff98 31 30 2e 30 2e 32 2e 31 30 20 32 32 30 4b 56 ffffffe5 ffffff85 ffffff83 ffffffe6 ffffffb1 ffffff9f ffffffe5 ffffff8f ffffff98 31 30 2e 30 2e 32 2e 34 20 ffffffe5 ffffffb0 ffffff8f ffffffe6 ffffffb2 ffffffb3 ffffffe5 ffffffba ffffff95 ffffffe4 ffffffb8 ffffff80 ffffffe7 ffffffba ffffffa7
@愤怒的小马
从贴出的错误信息看,错误仍然与字符编码相关。建议按照如下方法进行排查:
1. 使用tessdata_best/chi_sim下载的内容,不作任何改变,先训练另一种字体,确认该训练是否可以正常完成。若能完成,则说明工具与环境可以正常工作,可以执行下一步;若不能完成,先想办法让这一步训练能正常完成。
2. 在第一步能正常完成的前提下,确认chi_sim.ucharset的编码方式。确认后,只新增一个字符,单独训练此字符,确认此次训练是否能完成。若能完成,则编码方式正确,再尝试按照相同的编码方式增加训练更多的字符。
谢谢你解答:之前的问题已经解决(因为在lstmtraining –old_trainnedata 和 — traineddata 时有问题)
目前遇到问题,我使用微软雅黑,训练出来的结果数据 Error rate = 0,识别单个字符图片,ocr识别时为空,识别多个字符组成图片,识别出来的字符全部错误。
(chi_sim.singles_text,chi_sim.wordlist ,chi_sim.training_text ) 这些文件是否全部都要调整。
— 以下是是训练步骤。
tesstrain.sh –fonts_dir /usr/share/fonts/chinese \
–fontlist ‘SimHei’ \
–lang chi_sim \
–linedata_only \
–save_box_tiff \
–noextract_font_properties \
–langdata_dir /root/chi_sim/langdata_lstm \
–tessdata_dir /root/chi_sim/tessdata \
–output_dir /root/chi_sim/lstmf
combine_tessdata -e /root/chi_sim/tessdata/chi_sim.traineddata /root/chi_sim/lstmf/chi_sim.lstm
lstmtraining –continue_from /root/chi_sim/lstmf/chi_sim.lstm \
–model_output /root/chi_sim/lstmf/SimHei \
–old_traineddata /root/chi_sim/tessdata/chi_sim.traineddata \
–traineddata /root/chi_sim/lstmf/chi_sim/chi_sim.traineddata \
–train_listfile /root/chi_sim/chi_sim.training_files.txt \
–max_iterations 3000
lstmtraining –stop_training –traineddata /root/chi_sim/lstmf/chi_sim/chi_sim.traineddata \
–continue_from /root/chi_sim/lstmf/SimHei_checkpoint \
–model_output \root\zy.traineddata
@愤怒的小马
我不是特别清楚您目前的训练流程和问题点,从留言来看,我猜是这样的:
1. 训练目的:chi_sim.traineddata无法识别某些特殊字符,所以需要基于chi_sim.traineddata训练一个新的traineddata,可以识别这些特殊字符。
2. 训练步骤:
2.1 在chi_sim.unicharset和chi_sim.training_text中新增特殊字符编码和字符。
2.2 根据具体情况修改tesstrainDone.sh之后执行该.sh文件。
2.3 可以训练完成,但训练文件无法正常识别图片。(目前问题点)
请确认以上猜测是否正确。
我的疑问:
1. 在lstmtraining命令中:
–traineddata 应该是combine_tessdata生成的文件
–old_traineddata 是从github中下载的无任何修改的data文件
是否如上设定?
2. chi_sim.unicharset新增了哪些字符?如何确定该字符编码正确?
问题点。包含这些汉字的图片,识别出来牛头不对马嘴。
比如 35KV勐仰变10.0.2.14 稳门大身给你一个
1. 在lstmtraining命令中:
–traineddata 应该是combine_tessdata生成的文件
这个文件是在tesstrain.sh时生成的,
combine_tessdata只生成了 chi_sim.lstm
combine_tessdata -e /root/chi_sim/tessdata/chi_sim.traineddata /root/chi_sim/lstmf/chi_sim.lstm
——————————————————————————
–old_traineddata 是从github中下载的无任何修改的data文件
是否如上设定?
这个是从githup上tessdata_best获取的。
——————————————————————————–
2. chi_sim.unicharset新增了哪些字符 (垤,箐,勐 )?如何确定该字符编码正确? (正确)
langdata_lstm目录下有一个han.unicharset,里面包含这些字符的编码,是一致的。
hang.unicharset内容
垤 1 64,76,244,255,187,205,4,12,205,224 Han 3637 0 3637 垤 # 垤 [57a4 ]x
箐 1 56,64,255,255,166,204,3,28,205,224 Han 12283 0 12283 箐 # 箐 [7b90 ]x
勐 1 61,74,253,255,181,197,4,15,205,224 Han 2470 0 2470 勐 # 勐 [52d0 ]x
chi_sim.unicharset内容
1 0,255,0,255,0,0,0,0,0,0 Han 4022 0 4022 箐 # 箐 [7b90 ]x
勐 1 0,255,0,255,0,0,0,0,0,0 Han 4023 0 4023 勐 # 勐 [52d0 ]x
垤 1 0,255,0,255,0,0,0,0,0,0 Han 4024 0 4024 垤 # 垤 [57a4 ]x
上面回复有段些内容被隐藏了:
问题点。包含这些汉字的图片,识别出来牛头不对马嘴。
比如 35KV勐仰变10.0.2.14 识别结果为 稳门大身给你一个
@愤怒的小马
如果方便的话,麻烦将训练用的所有文件打包(最好有相应文件的说明文档)发到我邮箱(livezingy@163.com),我有空的时候试一下,有结果再讨论。
已将附件发送到你邮箱!!我试过好多方法,都不行。搞不明白为啥了。^_^
@愤怒的小马 感谢信任,不过这段时间非常忙,暂时没有时间来验证这个问题。
如果我有时间来验证,我的验证流程大致会如下:
1. 用你发的资料训练一遍,重现你所描述的现象。
预测结果:既然你已经尝试很多方法,那么我的验证结果很大概率跟你描述的一致。
2. 更换多种相关关键字(英文),尝试在github/tesseract/issues, stackoverflow, tesseract google group中寻找答案。如果没有,尝试在多渠道提问。
预测结果:应该可以找到现有一些相关描述或分析;如果找不到,提问能得到有效回复的几率并不高,因此不能完全寄希望于此。
3. 查阅tesseract doc以及相关源码,尝试解决问题。
以上2,3要同时进行,但一定是一个及其漫长且艰难的过程,万一有进展我也会及时在此贴下回复。
以上供参考,希望你的问题能早日解决。
@愤怒的小马
我今天在win10下用Tesseract4.1测试了你的训练文件,步骤如下:
1. sh training/tesstrain.sh –-fonts_dir “C:\Windows\Fonts” –fontlist ‘SimHei’ –lang chi_sim –linedata_only –noextract_font_properties –langdata_dir langdata –training_text langdata/chi_sim/chi_sim.training_text –tessdata_dir tessdata –output_dir train
可正常执行完成,无警告。
2. combine_tessdata -e tessdata/chi_sim.traineddata train/chi_sim.lstm
可正常执行完成,无警告。
3. lstmtraining –model_output train –continue_from train/chi_sim.lstm –traineddata train/chi_sim/chi_sim.traineddata –old_traineddata tessdata/chi_sim.traineddata –train_listfile train/chi_sim.training_files.txt –debug_interval -1 –max_iterations 3600
第三步执行的时候,输出的部分调试信息为:
Iteration 0: GROUND TRUTH : 10KV瑗块棬寮€闂墍 10.0.2.8
File train/chi_sim.SimHei.exp0.lstmf line 0 (Perfect):
Mean rms=0.373%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 1: GROUND TRUTH : 娓呮按娌崇數绔?Iteration 1: BEST OCR TEXT : 娓呮按娌崇數绔?File train/chi_sim.SimHei.exp0.lstmf line 1 :
就是说lstmtraining 执行过程中读取到的训练内容的GROUND TRUTH是chi_sim.training_text 中对应内容的ANSI码的形式,这样训练出来的结果应该是不可用的。
但是我确认过chi_sim.training_text存储格式确实是UTF8的格式,不知道您训练过程的现象是否与我这边类似?
同时,我也按照https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html#fine-tuning-for–a-few-characters中Fine Tuning for ± a few characters的步骤训练了plus-minus sign (±) 的内容,训练过程出现同样的现象,符号部分(包含±,%等等)的GROUND TRUTH均为ANSI码的形式。
针对plus-minus sign (±) 的训练,我同时尝试了Tesseract4.1和最新的Tesseract5(Github上下载安装版本),Tesseract4.1在执行lstmtraining时无法正常读取训练内容的GROUND TRUTH;而Tesseract5在读取第一个GROUND TRUTH会异常退出。
针对上述现象,我查阅了Github以及Tesseract google group的相关内容,未找到解决方案。抱歉,以我目前的能力暂时无法帮您解决这个问题。
this is error i get when i ran this on windows
$ sh tesstrainDone.sh
=== Starting training for language ‘ben’
[Sun Dec 27 22:55:15 BST 2020] /c/Program Files/Tesseract-OCR/text2image –fonts_dir=fonts –ptsize 12 –font=Siyam Rupali ANSI –outputbase=/tmp/font_tmp.OLhrwsK9tU/sample_text.txt –text=/tmp/font_tmp.OLhrwsK9tU/sample_text.txt –fontconfig_tmpdir=/tmp/font_tmp.OLhrwsK9tU
Rendered page 0 to file C:/Users/MASWOO~1/AppData/Local/Temp/font_tmp.OLhrwsK9tU/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Siyam Rupali ANSI
[Sun Dec 27 22:55:17 BST 2020] /c/Program Files/Tesseract-OCR/text2image –fontconfig_tmpdir=/tmp/font_tmp.OLhrwsK9tU –fonts_dir=fonts –strip_unrenderable_words –leading=32 –xsize=3600 –char_spacing=0.0 –exposure=0 –outputbase=/tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0 –max_pages=10 –font=Siyam Rupali ANSI –ptsize 12 –text=langdata_lstm/ben/ben.training_text
Stripped 407 unrenderable words
Rendered page 0 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 436 unrenderable words
Rendered page 1 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 418 unrenderable words
Rendered page 2 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 424 unrenderable words
Rendered page 3 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 423 unrenderable words
Rendered page 4 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 418 unrenderable words
Error in boxCreate: x < 0 and box off +quad
Rendered page 5 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 424 unrenderable words
Rendered page 6 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 419 unrenderable words
Rendered page 7 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 422 unrenderable words
Rendered page 8 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 426 unrenderable words
Rendered page 9 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
=== Phase UP: Generating unicharset and unichar properties files ===
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/unicharset_extractor –output_unicharset /tmp/ben-2020-12-27.q7g/ben.unicharset –norm_mode 2 /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.box
Failed to read data from: C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.box
Wrote unicharset file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/set_unicharset_properties -U /tmp/ben-2020-12-27.q7g/ben.unicharset -O /tmp/ben-2020-12-27.q7g/ben.unicharset -X /tmp/ben-2020-12-27.q7g/ben.xheights –script_dir=langdata_lstm
Loaded unicharset of size 3 from file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=tessdata
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/tesseract /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0 –psm 6 lstm.train langdata_lstm/ben/ben.config
Error opening data file tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ERROR: Program Program failed. Abort.
Extracting tessdata components from tessdata/ben.traineddata
Wrote train/ben.lstm
Version string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1]
0:config:size=377, offset=192
17:lstm:size=10605707, offset=569
18:lstm-punc-dawg:size=3154, offset=10606276
19:lstm-word-dawg:size=427618, offset=10609430
20:lstm-number-dawg:size=426, offset=11037048
21:lstm-unicharset:size=6866, offset=11037474
22:lstm-recoder:size=1003, offset=11044340
23:version:size=80, offset=11045343
Failed to load list of training filenames from train/ben.training_files.txt
Failed to read continue from: output/mohanonda_checkpoint
this is the error that i get, please help
$ sh tesstrainDone.sh
=== Starting training for language ‘ben’
[Sun Dec 27 22:55:15 BST 2020] /c/Program Files/Tesseract-OCR/text2image –fonts_dir=fonts –ptsize 12 –font=Siyam Rupali ANSI –outputbase=/tmp/font_tmp.OLhrwsK9tU/sample_text.txt –text=/tmp/font_tmp.OLhrwsK9tU/sample_text.txt –fontconfig_tmpdir=/tmp/font_tmp.OLhrwsK9tU
Rendered page 0 to file C:/Users/MASWOO~1/AppData/Local/Temp/font_tmp.OLhrwsK9tU/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Siyam Rupali ANSI
[Sun Dec 27 22:55:17 BST 2020] /c/Program Files/Tesseract-OCR/text2image –fontconfig_tmpdir=/tmp/font_tmp.OLhrwsK9tU –fonts_dir=fonts –strip_unrenderable_words –leading=32 –xsize=3600 –char_spacing=0.0 –exposure=0 –outputbase=/tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0 –max_pages=10 –font=Siyam Rupali ANSI –ptsize 12 –text=langdata_lstm/ben/ben.training_text
Stripped 407 unrenderable words
Rendered page 0 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 436 unrenderable words
Rendered page 1 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 418 unrenderable words
Rendered page 2 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 424 unrenderable words
Rendered page 3 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 423 unrenderable words
Rendered page 4 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 418 unrenderable words
Error in boxCreate: x < 0 and box off +quad
Rendered page 5 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 424 unrenderable words
Rendered page 6 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 419 unrenderable words
Rendered page 7 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 422 unrenderable words
Rendered page 8 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Stripped 426 unrenderable words
Rendered page 9 to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!
=== Phase UP: Generating unicharset and unichar properties files ===
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/unicharset_extractor –output_unicharset /tmp/ben-2020-12-27.q7g/ben.unicharset –norm_mode 2 /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.box
Failed to read data from: C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.box
Wrote unicharset file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/set_unicharset_properties -U /tmp/ben-2020-12-27.q7g/ben.unicharset -O /tmp/ben-2020-12-27.q7g/ben.unicharset -X /tmp/ben-2020-12-27.q7g/ben.xheights –script_dir=langdata_lstm
Loaded unicharset of size 3 from file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file C:/Users/MASWOO~1/AppData/Local/Temp/ben-2020-12-27.q7g/ben.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=tessdata
[Sun Dec 27 22:55:22 BST 2020] /c/Program Files/Tesseract-OCR/tesseract /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0.tif /tmp/ben-2020-12-27.q7g/ben.Siyam_Rupali_ANSI.exp0 –psm 6 lstm.train langdata_lstm/ben/ben.config
Error opening data file tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ERROR: Program Program failed. Abort.
Extracting tessdata components from tessdata/ben.traineddata
Wrote train/ben.lstm
Version string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1]
0:config:size=377, offset=192
17:lstm:size=10605707, offset=569
18:lstm-punc-dawg:size=3154, offset=10606276
19:lstm-word-dawg:size=427618, offset=10609430
20:lstm-number-dawg:size=426, offset=11037048
21:lstm-unicharset:size=6866, offset=11037474
22:lstm-recoder:size=1003, offset=11044340
23:version:size=80, offset=11045343
Failed to load list of training filenames from train/ben.training_files.txt
Failed to read continue from: output/mohanonda_checkpoint
Could you train the [tesstrain/fonts/Impact.ttf] successfully?
请教个问题,根据你的方案执行到
lstmtraining –model_output=”output\output” –continue_from=”train\chi_sim.lstm” –train_listfile=”train\sgs.training_files.txt” –traineddata=”train/chi_sim.traineddata” –old_trainddata=”tessdata/chi_sim.traineddata” –debug_interval -1 –max_iterations 10000 –target_error_rate 0.01
ERROR: Non-existent flag –old_trainddata=tessdata/chi_sim.traineddata
出现了这个错误,请问这是什么问题
请参考这里https://github.com/tesseract-ocr/tesseract/issues/1666检查一下训练用的版本和文件