影响pytesseract识别结果的几项设定
环境说明
1 2 3 4 5 6 7 |
# If you don't have tesseract executable in your PATH, include the following: pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>' # Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract' # Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"' # It's important to add double quotes around the dir path. tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"' pytesseract.image_to_data(image, lang='chi_sim', config=tessdata_dir_config) |
PSM in pytesseract
PSM的全称是Possible modes for page layout analysis,默认设定值为PSM_SINGLE_BLOCK(Assume a single uniform block of text)。PSM定义可以在include/tesseract/publictypes.h查看,如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
/** * Possible modes for page layout analysis. These *must* be kept in order * of decreasing amount of layout analysis to be done, except for OSD_ONLY, * so that the inequality test macros below work. */ enum PageSegMode { PSM_OSD_ONLY = 0, ///< Orientation and script detection only. PSM_AUTO_OSD = 1, ///< Automatic page segmentation with orientation and ///< script detection. (OSD) PSM_AUTO_ONLY = 2, ///< Automatic page segmentation, but no OSD, or OCR. PSM_AUTO = 3, ///< Fully automatic page segmentation, but no OSD. PSM_SINGLE_COLUMN = 4, ///< Assume a single column of text of variable sizes. PSM_SINGLE_BLOCK_VERT_TEXT = 5, ///< Assume a single uniform block of ///< vertically aligned text. PSM_SINGLE_BLOCK = 6, ///< Assume a single uniform block of text. (Default.) PSM_SINGLE_LINE = 7, ///< Treat the image as a single text line. PSM_SINGLE_WORD = 8, ///< Treat the image as a single word. PSM_CIRCLE_WORD = 9, ///< Treat the image as a single word in a circle. PSM_SINGLE_CHAR = 10, ///< Treat the image as a single character. PSM_SPARSE_TEXT = 11, ///< Find as much text as possible in no particular order. PSM_SPARSE_TEXT_OSD = 12, ///< Sparse text with orientation and script det. PSM_RAW_LINE = 13, ///< Treat the image as a single text line, bypassing ///< hacks that are Tesseract-specific. PSM_COUNT ///< Number of enum entries. }; |
接下来我们用一张仓库货架的图片来进行测试,用tesseract来定位并识别货架中货物的编号。代码如下,测试时我们会更改PSM不同设置值的来观察其差异:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
from pytesseract import Output import pytesseract import cv2 import imutils image = cv2.imread("ocr.jpg") #swap color channel ordering from BGR (OpenCV’s default) to RGB (compatible with Tesseract and pytesseract). # By default OpenCV stores images in BGR format and since pytesseract assumes RGB format, # we need to convert from BGR to RGB format/mode: rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = pytesseract.image_to_data(rgb, output_type=Output.DICT,lang='eng',config='-c tessedit_char_whitelist=0123456789 --psm 12') for i in range(0, len(results["text"])): # extract the bounding box coordinates of the text region from the current result tmp_tl_x = results["left"][i] tmp_tl_y = results["top"][i] tmp_br_x = tmp_tl_x + results["width"][i] tmp_br_y = tmp_tl_y + results["height"][i] tmp_level = results["level"][i] conf = results["conf"][i] text = results["text"][i] if(tmp_level == 5): cv2.putText(image, text, (tmp_tl_x, tmp_tl_y - 10), cv2.FONT_HERSHEY_SIMPLEX,1, (0, 0, 255), 3) cv2.rectangle(image, (tmp_tl_x, tmp_tl_y), (tmp_br_x, tmp_br_y), (0, 0, 255), 2) showImg = imutils.resize(image, width=1600) cv2.imshow("image",showImg) cv2.waitKey(0) |
第1~4行导入所需库文件;
第6行读取与当前py文件相同路径的示例图像;
第12行调用pytesseract的image_to_data()函数获取识别结果,输出格式为”Output.DICT”,这样我们可以根据其输出结果绘制出文字块以及识别结果。
这里我们用到了config, PSM在此进行设置,whitelist后的字符串可以指定识别字符的范围。tesseract中还有blacklist,顾名思义,它的作用是排除识别字符的范围,设置方法相同。
第14~26行取出识别结果中文字块的相关信息,据此信息在原图中将文字用矩形框出,并将识别结果绘制在图像中;
第28行用到了imutils.resize,该函数课指定宽度并按照原图比例进行缩放。本文示例的图像太宽,缩小后便于观察。
通过更改pytesseract.image_to_data中的config的设定,我分别测试了如下几种组合:
RGB, PSM=12,No whitelist, conf>50, 其意义为: config=’–psm 12’,第 24行代码if条件处再加上conf>50;
RGB, PSM=12,With whitelist,其意义为:config=’-c tessedit_char_whitelist=0123456789 –psm 12′;
RGB, PSM=11,With whitelist,其意义为:config=’-c tessedit_char_whitelist=0123456789 –psm 12′;
GRAY, PSM=12,With whitelist,其意义为: config=’-c tessedit_char_whitelist=0123456789 –psm 12’。该项测试需要将第10行代码中的cv2.COLOR_BGR2RGB更改为cv2.COLOR_BGR2GRAY进行测试。
测试结果如下动图所示:
由以上示例可知,PSM的设定值对总体识别结果有明显影响,我们在使用的时候需要根据实际情况进行调整。
上述动图中原图及tessract识别效果最好的图片如下:
本小节标题虽然是PSM in pytesseract,实际上只要是tesseract,在C++或者命令行中使用tesseract.exe,PSM均有类似的影响。
tesseract的输入图像是否需要二值化
1 2 3 4 5 |
image = cv2.imread("pytesseract.png") #swap color channel ordering from BGR (OpenCV’s default) to RGB (compatible with Tesseract and pytesseract). # By default OpenCV stores images in BGR format and since pytesseract assumes RGB format, # we need to convert from BGR to RGB format/mode: rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) |
关于tesseract的最佳输入图像,到底是二值化图像,RGB彩色图像,还是灰度图像?Tesseract并没有比较正式的文档来说明。不过在 RFC: allow flexible or better binarization #3083这个链接中,有tesseract的多位开发者对这个问题进行了一些讨论。
这里记录一下我个人对该链接的理解(仅适用于tesseract4.0及以上版本):
3. tesseract在对图像布局进行分析和分割图像时,使用的仍是二值化图像。若输入RGB彩色图像或灰度图,tesseract会使用自己内部的二值化算法来进行二值化后用于图像分割和布局分析,而这个二值化方法效果比较一般。
针对上述链接的讨论,我尝试用opencv对BGR图片灰度化后作为tesseract输入进行测试,测试结果如下:
对于本文的测试图像而言,从第一小节的测试动图来看,输入RGB彩色图像,使用Tesseract内部灰度化的效果是较好的。
总的来说,对于tesseract4.0而言,其输入图像并没有一个最佳定论,大家需要根据自己的需求和以上讨论进行选择或搭配。例如,若图像布局复杂,我们可以输入质量较高的二值化图像,对图像进行分割;根据图像分割的坐标将灰度图进行分割,将分割后的灰度图作为输入进行文字识别。当然也希望下一版本的Tesseract能针对这个问题给出更好的解决方案。
本文到此结束,感谢阅读,欢迎关注。