Train Tesseract LSTM methods Comparison.
Train Tesseract LSTM with tesstrain.sh on Windows.
How the makefile in tesstrain-win work
Win10 Tesseract4.1 LSTM training.
The repository tesseract-ocr/tesstrain on github could implement LSTM training. It is powerful, simple and easy to use. But it only works in Linux. To make the project work in Windows, I made some changes to the makefile and file structure, and the modified project is tesstrain-win.This article mainly records the use of tesstrain-win and its transformation process.
1. win10 x64.
2. The Tesseract4.1 compiled with the source code.
The Tesseract4.0+ compiled with the source code has all the training tools needed by tesstrain, the other installation version may not contain the training tools. This article uses the source-compiled Tesseract 4.1 release.
How to compile the source code of Tesseract4.0+ on Windows, Please refer to Tesseract 4.0 + VS2017 + Win10 compilattion strategy
The Down load link of the compiled version 4.1 in this article with training
Python’s Pillow library is used for image processing in makefile, so confirm that the version is successfully installed.
Running makefile under windows requires the help of cygwin, installation method refer to Install Cygwin on Win10 for makefile.
Description of the tesstrain-win files
The tesstrain-win project has two folders, data and old-ocrd, as well as several files, which are roughly structured as shown in the figure:
The ground-truth sample data, unzipd under the current path, can be tested for training, and if you train a custom word library, replace the file with a custom ground truth.
These two files are used in training and there are same regardless of language. They could be downloaded from tesseract-ocr/langdata_lstm. In most cases, no need to change.
The above three files are related to the training language. For example, if the training language is English, the files could be downloaded from tesseract-ocr/langdata_lstm/tree/master/eng. Then rename them as needed.
old-ocrd: This folder is the predecessor of tesseract-ocr/tesstrain, formerly known as OCR-D/ocrd-train. Its makefile is very different from the existing tesseract-ocr/testrain makefile. It can help us understand how makefile works.
makefile: In order to make it run on windows, it has been modified accordingly.
The other files are the same as tesseract-ocr /tesstrain, which is not described here.
Before you train your own database, it is recommended to test with the ocrd-testset.zip in github. If the oced-testset.zip could be training successfully, it indicates that all the training tools on your computer are ready to use. So let’s start with the example of ocrd-testset.zip to illustrate how to use tesstrain-win. If you encounter an error in training, you can refer to the resolution by going to the “Errors You May Encounter and The Corresponding Solutions” section of this article.
1. Unzip ocrd-testset.zip.
After downloading and decompressing tesstrain-win, unzip ./data/foo-ground-truth/ocrd-testset.zip to ./data/foo-ground-truth.
2. Run the command prompt as an administrator and enter the path where tesstrain-win is located.
3. Run make training.
It is recommended running make training–trace,– trace can output every command in makefile on the command line. it easy to locate the location and cause of the error if an error occurs.
4. when the training completed
If no errors, It takes about 8 hours that ocrd-testset.zip completes training in win10 64-bit 8G PC.
2 Percent improvement time=1059, best error was 3.1 @ 5121
At iteration 6180/10000/10000, Mean rms=0.487%, delta=0.265%, char train=0.924%, word train=3.421%, skip ratio=0%, New best char error = 0.924 wrote best model:data/checkpoints/foo0.924_6180.checkpoint wrote checkpoint.
Finished! Error rate = 0.924
Makefile:158: update target 'data/foo.traineddata' due to: data/checkpoints/foo_checkpoint
--continue_from data/checkpoints/foo_checkpoint \
--traineddata data/foo/foo.traineddata \
Loaded file data/checkpoints/foo_checkpoint, unpacking...
Train a custom traineddata
1. Name the traineddata
You can name you directly by changing the 11th line in makefile:
Or name it when you run make training.
make training MODEL_NAME foo
2. Prepare the starter traineddata
TRAINING FROM OF SCRATCH: the START_MODEL in makefile has no value. There is no need to prepare the starter traineddata.
Fint-Tune: For Example, if START_MODEL=eng in makefile, we should download the eng.traineddata from tessdata_best and put it in the folder data/tessdata.
3. Update the foo.numbers/foo.punc/foo.wordlist in the data folder.
If the starter traineddata is eng, download the .numbers/.punc/.wordlist from the langdata_lstm/eng, then rename them: foo.numbers/foo.punc/foo.wordlist.
4. Prepare ground-truth.
The ground-truth consists of an image file and its corresponding text file.
The image must be in TIFF format and the suffix can be .tif/.png/.bin.png/.nrm.png.
The name of the text file is the same as the image name, the content is the text corresponding to the image, and the suffix is named .gt.txt.
Both the image and the corresponding text are required to be a single line of text.
The gound-truth path is: data/foo-ground-truth.
5. Run the command prompt as an administrator and enter the path where tesstrain-win is located.
6. Run make training.
Errors and the appropriate solutions that you may encounter
You may encounter all kinds of errors during the training traineddata, If there is, come and see that if there are any errors you’ve encountered here:
1. /bin/bash: python3: command not found.
If you are running tesseract-ocr/tesstrain, you may encounter this error. Modify all of your makefile’s python3 to python.
Reason for modification:
The python3 execution file is called in the console program with the name python.exe on Win10, and is called pythonw.exe in the GUI program. Please refer to bash-python3-command-not-found-windows-discord-py.
2. UnicodeEncodeError: ‘gbk’ codec can’t encode character。
When you run OCR-D/ocrd-train makefile, the above error may occur. The program like this in the older version of the makefile:
python generate_line_box.py -i "$(GROUND_TRUTH_DIR)/$*.tif" -t "$(GROUND_TRUTH_DIR)/$*.gt.txt" > "$@"
Solution: Add PYTHONIO-Utf-8.
PYTHONIOENCODING=utf-8 python $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"
3. /bin/bash: …/Microsoft/WindowsApps/python3: Permission denied.
The reference link to solve this error:
“Permission Denied” trying to run Python on Windows 10.
How to manage an application to perform aliases on Win10.
3.1 Make sure that the user variables–>Path,
the [……\AppData\Local\Programs\Python\Python36\] is above the [%USERPROFILE%\AppData\Local\Microsoft\WindowsApps], refer to the following image:
3.2 Disable Python under the error path by “Managing application execution aliases” :
Right-click on the Start menu–>setting–>Application–>Application and functions–> App execution aliases, turn off python.exe and python 3.exe, as shown in the following image:
Some computers only need to do as 3.1, some need 3.1 and 3.2 simultaneous. My computer needs two steps at the same time.
4.command not found.
/bin/bash: wget: command not found.
/bin/bash: line 1: bc: command not found.
Please refer to Install Cygwin on Win10 for makefile.com to confirm that wget, bc is installed.
5.Failed to load any lstm-specific dictionaries for lang…
This error does not occur during training with tesseract-ocr/tesstrain, and It could work normally when the traineddata completed. But there will report a warning.
This error relates to three files with the suffix . wordlist/.numberss/.punc in the data folder.
[.wordlist]: System word list language model.
[.numbers]: With allowed digital patterns.
[.punc]: Punctuation marks are allowed around words.
In tesseract-ocr/tesstrain, the default path for the three files above is $ (OUTPUT_DIR)=data/$(MODEL_NAME), and all the files under this path are automatically generated during training.
If the START_MODEL in makefile has no value, the makefile does not generate any related files under the path;
If the START_MODEL has value, the following statement in makefile is executed:
combine_tessdata -u $(TESSDATA)/$(START_MODEL).traineddata \
The statement breaks down the traineddata into several files such as foo.lstm-number-dawg, foo.lstm-punc-dawg, foo.lstm-word-dawg, etc. But what is needed for training is .wordlist/.numbers/.punc files. Therefore, if you train with tesseract-ocr/testrain, It will report the warning when you call the traineddata trained by yourself.
The makefile in tesstrain-win has not been fixed on this issue, using another solution: modify the path of the relevant variable, prepare the .wordlist/.numbers/punc file under the corresponding path, and do not have this warning when called after training with tesstrain-win.
If you want to train with tesseract-ocr/tesstrain and want to solve this problem, you can search for WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE in makefile, modify them refer to the makefile in tesstrain-win.
The Related links to the problem point failed to load any lstm-specifics for lang xxx.
6./bin/bash: merge_unicharsets: command not found.
If the START_MODEL in makefile has a value, the following statement is executed:
If you trained with the compiled tesseract4.1, the following error may occur:
/bin/bash: merge_unicharsets: command not found.
The reason is that the compiled tesseract4.1 has no merge-unicharsets, the specific reason and solution is not known now.
The merge_unicharsets function is to combine the exsiting character set with the new characters added at the Fine_tune. To change the statement to the following statement, which can complete the training execution without the merge_unicharsets command, and the final word library effect may differ.
The modified is to use the new character set as the total character set for training.
cp "$(OUTPUT_DIR)/my.unicharset" "data/unicharset"
Alternatively, you can use the character set of starter traineddata as the total character set for your training.
cp "/$(START_MODEL)/$(MODEL_NAME).lstm-unicharset" "data/unicharset"
This article ends, thank you for reading and welcome to follow.