Tesseract4.0 APIExamples 验证实录与函数解析
本文主要记录Tesseract4.0 APIExamples中的Result iterator example,Example of iterator over the classifier choices for a single symbol,Example to get confidence for alternative symbol choices per character for LSTM这三个示例的运行结果和相关函数的解析。测试均在Tessrect4.0+VS2017+Win10的条件下进行,如果对于安装有疑问,请参考这里:Tesseract4.0+VS2017+win10源码编译攻略
1. 前言
想让Tesseract在应用程序中发挥更好的性能,那就要遵守一些基本的使用规则,我认为最基本的规则是“白纸黑字”和“阅后即焚”。
白纸黑字
在Tesseract Github的wiki中,建议4.X以上的版本待识别图像为“白纸黑字”的风格,这一项在其ImproveQuality的有明确说明。
While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background.
阅后即焚
Tesseract的绝大多数常用API都要求使用完成后,删除相应指针。在APIExamples的源码中可以看到这一点,另外在tesseract/src/api/baseapi.cpp的函数注释中大量出现了类似语句。
The returned iterator must be deleted after use.
例如,我们常用的用于接收GetUTF8Text识别结果的指针,也是需要清理的,针对这个问题,Tesseract的tesseract/unittest/baseapi_test.cc中有一个很好的范例,值得借鉴。
1 2 3 4 5 6 7 8 |
std::string GetCleanedTextResult(tesseract::TessBaseAPI* tess, Pix* pix) { tess->SetImage(pix); char* result = tess->GetUTF8Text(); std::string ocr_result = result; delete[] result; absl::StripAsciiWhitespace(&ocr_result); return ocr_result; } |
2 APIExamples源码与验证结果说明
为了方便对比,本文验证的三个示例均使用同一个测试图片,源代码中命名为【phototest.tif】的图片,如下所示:
【1】Result iterator example输出了识别结果中的每个单词,当前单词可信度[0最低,100最高],单词在原图中所在的外接矩形坐标。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Pix *image = pixRead("/usr/src/tesseract/testing/phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); api->Recognize(0); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_WORD; if (ri != 0) { do { const char* word = ri->GetUTF8Text(level); float conf = ri->Confidence(level); int x1, y1, x2, y2; ri->BoundingBox(level, &x1, &y1, &x2, &y2); printf("word: '%s'; \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf, x1, y1, x2, y2); delete[] word; } while (ri->Next(level)); } |
【2】Example of iterator over the classifier choices for a single symbol,该示例设置识别区域是原图中的第5行文字,即”lazy fox. The quick brown dog jumped”。该范例会输出每个字符的识别结果以及当前字符的可信度。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Pix *image = pixRead("/usr/src/tesseract/testing/phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); api->SetVariable("save_blob_choices", "T"); api->SetRectangle(37, 228, 548, 31); api->Recognize(NULL); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_SYMBOL; if(ri != 0) { do { const char* symbol = ri->GetUTF8Text(level); float conf = ri->Confidence(level); if(symbol != 0) { printf("symbol %s, conf: %f", symbol, conf); bool indent = false; tesseract::ChoiceIterator ci(*ri); do { if (indent) printf("\t\t "); printf("\t- "); const char* choice = ci.GetUTF8Text(); printf("%s conf: %f\n", choice, ci.Confidence()); indent = true; } while(ci.Next()); } printf("---------------------------------------------\n"); delete[] symbol; } while((ri->Next(level))); } |
【3】Example to get confidence for alternative symbol choices per character for LSTM,该示例输出了Tesseract认为当前图片中可能的字符,该字符在单词中的排序,相应值的可信度,当前字符所在的外接矩形。
下面截图的是“the lazy”的识别结果。Tesseract识别出的”t”是单词”the”的第一个字符,99%可能是”t”,0%是”h”。而单词”lazy”的第一个字母”l”,Tesseract认为它95%可能是”l”,2%可能是”a”,%1可能是I,%0可能是”j”。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); // Initialize tesseract-ocr with English, without specifying tessdata path if (api->Init(NULL, "eng")) { fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); } // Open input image with leptonica library Pix *image = pixRead("/home/ubuntu/tesseract/test/testing/trainingital.tif"); api->SetImage(image); // Set lstm_choice_mode to alternative symbol choices per character, bbox is at word level. api->SetVariable("lstm_choice_mode", "2"); api->Recognize(0); tesseract::PageIteratorLevel level = tesseract::RIL_WORD; tesseract::ResultIterator* res_it = api->GetIterator(); // Get confidence level for alternative symbol choices. Code is based on // https://github.com/tesseract-ocr/tesseract/blob/master/src/api/hocrrenderer.cpp#L325-L344 std::vector<std::vector<std::pair<const char*, float>>>* choiceMap = nullptr; if (res_it != 0) { do { const char* word; float conf; int x1, y1, x2, y2, tcnt = 1, gcnt = 1, wcnt = 0; res_it->BoundingBox(level, &x1, &y1, &x2, &y2); choiceMap = res_it->GetBestLSTMSymbolChoices(); for (auto timestep : *choiceMap) { if (timestep.size() > 0) { for (auto & j : timestep) { conf = int(j.second * 100); word = j.first; printf("%d symbol: '%s'; \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n", wcnt, word, conf, x1, y1, x2, y2); gcnt++; } tcnt++; } wcnt++; printf("\n"); } } while (res_it->Next(level)); } // Destroy used object and release memory api->End(); pixDestroy(&image); return 0; } |
观察上述三个范例的源码,我们可以发现它们的初始化部分以及整体功能架构是非常相似的,基本顺序为:读取输入图像–>初始化tesseract–>SetImage–>Recognize–>然后从ResultIterator和PageIteratorLevel获取不同的数据并输出。
第【2】个例子中有调用SetRectangle,该函数可以让Tesseract识别指定矩形区;第【2】【3】个例子中均调用了SetVariable,但它们设定了不同的参数。
1 2 3 4 5 6 7 8 9 10 11 |
//用leptonica库函数打开输入图像 Pix *image = pixRead("/usr/src/tesseract/testing/phototest.tif"); //初始化tesseract-ocr为英语,未指定tessdata的路径 tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); //识别tesseract设置的全局图像,识别结果是tesseract内部数据结构 api->Recognize(0); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_WORD; |
首先我们来看一下PageIteratorLevel和ResultIterator。
PageIteratorLevel是一个枚举变量类型,它的定义函数路径为tesseract/include/tesseract/publictypes.h。
1 2 3 4 5 6 7 8 9 10 11 12 |
/** * enum of the elements of the page hierarchy, used in ResultIterator * to provide functions that operate on each level without having to * have 5x as many functions. */ enum PageIteratorLevel { RIL_BLOCK, // Block of text/image/separator line. RIL_PARA, // Paragraph within a block. RIL_TEXTLINE, // Line within a paragraph. RIL_WORD, // Word within a textline. RIL_SYMBOL // Symbol/character within a word. }; |
ResultIterator则是一个单独用于存储Tessereact识别结果的类,在调用ResultIterator之前,需要已正确初始化tesseract,并且已对图片执行过识别,图片识别可以由recognition/GetUTF8Text/TesseractRect等等函数来实现。该类有各种函数可以获取到Tesseract的识别结果,含字符,字符位置,单词,单词位置,字符与单词可信度等等。具体功能可以查阅其头文件,头文件路径为include/tesseract/resultiterator.h:
1 2 3 4 |
// File: resultiterator.h // Description: Iterator for tesseract results that is capable of // iterating in proper reading order over Bi Directional // (e.g. mixed Hebrew and English) text. |
这样我们就可以理解了,Tesseract识别完成后,会将相关的识别结果存放在ResultIterator中,通过PageIteratorLevel可以设定当前获取的识别结果信息类型,可以设定行,段,单词,字符等为单位。通过设定合适的变量,即可从ResultIterator取用需要的值。
例【2】和例【3】中用到的SetVariable是一个神通广大的函数,在src/ccmain/tesseractclass.h中有其部分可设定变量以及含义解释,但是在APIExamples中出现过的”save_blob_choices”和”lstm_choice_mode”并未在该文件中找到,而且可惜的是在Tesseract的Wiki中找不到其完整的可设定变量及相关意义,有知道完整定义的同学还请不吝赐教。
1 2 3 4 5 6 7 8 |
INT_VAR_H(tessedit_ocr_engine_mode, tesseract::OEM_DEFAULT, "Which OCR engine(s) to run (Tesseract, LSTM, both). Defaults" " to loading and running the most accurate available."); STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to recognize"); STRING_VAR_H(tessedit_char_whitelist, "", "Whitelist of chars to recognize"); STRING_VAR_H(tessedit_char_unblacklist, "", "List of chars to override tessedit_char_blacklist"); |
本文到此结束,感谢阅读。